# POLITECNICO DI TORINO Master of Science in Electronic Engineering ### Master Thesis # Hardware acceleration for post-quantum cryptography Supervisors Prof. Guido Masera Prof. Maurizio Martina Candidate Flavio Tanese December 2018 #### Abstract Communication security of today heavily relies on the assumption that some mathematical problems are extremely difficult to solve and thus breaking encryptions based on such problems requires a very long time. While such encryptions are secure now, the probable diffusion of quantum computers in the foreseeable future makes the initial assumption fall short: quantum computations are efficient at breaking the most widespread algorithms in use. Post-quantum cryptographic systems are based on problems that are not (or marginally) affected by the peculiarity of quantum computing: AES[3] and many other hashing functions fall in this category, with quantum operations just moving the problem from O(N) to $O(\sqrt{N})$ , with N being the number of operations needed to find a solution. This is effectively countered by using double the number of bits and squaring the complexity. Other proposals are based on variations of error-correcting codes used in data transmission, so that the data is encoded and errors are purposely introduced in the encrypted version. With no a priori knowledge on the location of such errors, reverseengineering the generation matrix becomes a very arduous task, making the system de facto equivalent to the prime-based asymmetric key system in use today but without the vulnerability to quantum attacks. This work is focused on a hardware implementation of such a system, for use in low power applications that are likely to generate the bulk of encrypted traffic in the near future. # Contents | 1 | Bas | ics of cryptography | 3 | |---|-----|---------------------------------------------------------|-----------| | | 1.1 | Symmetric and asymmetric ciphers | 4 | | | 1.2 | RSA | 6 | | | 1.3 | Shor's algorithm | 7 | | 2 | The | e LEDAcrypt cryptosystem | 8 | | | 2.1 | QC-LDPC codes | 8 | | | 2.2 | LEDAcrypt's keys | 9 | | | 2.3 | Encryption and decryption | 10 | | 3 | Har | dware implementation of LEDAcrypt decryption | <b>12</b> | | | 3.1 | Assumptions on memory | 13 | | | 3.2 | System parameters | 13 | | 4 | Key | reconstruction | 15 | | | 4.1 | Circulant block multiplication | 15 | | | | 4.1.1 Out-of-order result and modulo $p$ implementation | 17 | | | | 4.1.2 Result sorting | 19 | | | | 4.1.3 Modulo 2 on compressed matrices | 20 | | | 4.2 | Circulant block sum | 21 | | | | 4.2.1 Memory movement | 22 | | | 4.3 | Quasi-cyclic multiplication | 22 | | 5 | Vec | tor by matrix multiplication (and vice-versa) | 25 | | | 5.1 | Vector by circulant matrix | 25 | | | 5.2 | Circulant matrix by vector | 27 | | | 5.3 | $x \text{ by } \mathbf{L}^T$ | 28 | | | 5.4 | $\mathbf{H}^T$ by $s^T$ | 29 | | | 5.5 | $\mathbf{Q}^T$ by $\Sigma^T$ | 30 | | | 5.6 | $e$ by $\mathbf{H}^T$ | 30 | | 6 | $\mathbf{Err}$ | or update | 32 | |--------------|----------------|----------------------------------------|----| | | 6.1 | Peak search | 32 | | | 6.2 | Row extraction from compressed matrix | 33 | | | 6.3 | Vector plus compressed row | 33 | | 7 | Mai | in loop state machine | 35 | | | 7.1 | Design modularity and shared resources | 36 | | 8 | Con | aclusions | 38 | | $\mathbf{A}$ | Sou | rce code | 40 | | | A.1 | Key reconstruction | 40 | | | | A.1.1 Sorting | 43 | | | | A.1.2 Circulant sum | 47 | | | | A.1.3 Memory copy | 50 | | | A.2 | Vector by matrix | 52 | | | | A.2.1 Vector by circulant | 52 | | | | A.2.2 $x$ by $\mathbf{L}^T$ | 55 | | | | A.2.3 $\mathbf{H}^T$ by $s^T$ | 56 | | | | A.2.4 $\mathbf{Q}^T$ by $\Sigma^T$ | 58 | | | | A.2.5 $e$ by $\mathbf{H}^T$ | 60 | | | A.3 | Error update | 62 | | | A.4 | Loop control and message computation | 66 | | | A.5 | Top module | 71 | # Chapter 1 # Basics of cryptography Cryptography comes from two ancient Greek words that more or less translate to "hidden writing", originally with the objective of having a reliable way to deliver military orders through messengers without the enemy understanding intercepted ones[4]. While this specific application proved by far the biggest drive to cryptography up to recent times, this "hidden writing" capability is now heavily used by civilians too due to the vast amount of sensitive information that is transmitted through potentially unsecure channels. Traditionally, the agents in cryptography examples are called Alice (A, the sender), Bob (B, the recipient) and Eve (E, the eavesdropper). Alice and Bob use cryptography so they can communicate without Eve being able to understand the message, even though Eve might intercept the code (from here on, "code" is used to refer to the encrypted version of the message). Since it is assumed that Eve knows the code, Bob must have some information not contained in the code that can be used to get the message back from it: this information is known as the "key" (figure 1.1). In the oldest and simplest ciphers, the number of possible keys was usually quite small, so that Eve could simply try them all and see which key yielded a message that made sense. This was only effective as long as Eve did not have a clue about the mechanism of the cipher, so that Alice and Bob mostly relied on what is called "security by obscurity" and had to keep the cipher itself secret. Since a secret mechanism does not scale up well with many possible recipients and devising a new cipher for each recipient would be a daunting task (that still requires a secure channel anyway), modern ciphers are public, while relying on other features to protect messages. These features arise from particular mathematical properties and are meant to prevent anyone not having the key from decrypting the code, while the possible number of keys is so big that brute forcing (i.e. trying them all one by one) is pointless. Figure 1.1: Alice, Bob and Eve ### 1.1 Symmetric and asymmetric ciphers Ciphers in use today fall into two broad categories: symmetric and asymmetric ciphers. The former category is made up with all systems that require Alice and Bob to know the same key (hence "symmetric"), and Eve not to know the key: it requires a secure channel for sharing the key between Alice and Bob in the first place (figure 1.2). The latter is made up with systems that have Bob know a private key nobody else knows (hence "asymmetric"), and everyone know Bob's public key that is used to encrypt the message (figure 1.3). The system is then conceived in such a way that only Bob's private key can decrypt what was encrypted with the public key. Asymmetric ciphers, while not extremely complicated in their most basic form, are much more recent than symmetric ciphers: the earliest military implementation was devised in 1973, while the first civilian algorithm dates 1976. Symmetric ciphers are usually very simple and very fast to implement both in software and hardware. Unfortunately, they can only be used when Alice and Bob have a way to agree on a key without Eve intercepting it. Asymmetric ciphers let Alice and Bob communicate without any need to share their private keys, but are usually very slow and possibly quite complex. Neither of the categories can respond to the need for a massive amount of data to be transferred securely and quickly, but an asymmetric cipher can be used to send a symmetric key without Eve knowing it, and that key can then be used to encrypt and decrypt the data (figure 1.4). The Internet itself relies on such a method for its TLS (Transport Layer Security) protocol. While TLS does not mandate any particular algorithm, the most common choice is RSA (Rivest-Shamir-Adleman, its inventors) as Figure 1.2: Secure channel for key transmission Figure 1.3: Asymmetric key not requiring a secure channel Figure 1.4: Asymmetric cypher creating a secure channel for a symmetric key asymmetric cipher and AES (Advanced Encryption Standard) as symmetric cipher. ### 1.2 RSA RSA[5] is a relatively simple asymmetric cypher published in 1978, just two years after the first non-classified paper on such cryptosystems by Diffie and Hellman. The simplest explanation of RSA, while not really exact, is straightforward: choose two very big prime numbers, keep them secret and provide their product as public key. Factoring such a huge number into its two prime factors is extremely demanding in terms of computational power, and whenever processors become faster and more powerful it is a simple matter of choosing even bigger starting numbers, making the algorithm extremely scalable. The actual RSA key generation algorithm is as follows: - $\bullet$ Randomly pick two distinct prime numbers, p and q - Compute n as n = pq - Compute $\lambda(n)$ as the least common multiple of p-1 and q-1 - Randomly pick an integer e that is smaller than $\lambda(n)$ and coprime with it (no common divisors other than 1) - Compute d such that $de \pmod{\lambda(n)} = 1$ • Share n and e as public key, while keeping d (and technically n) as private key Encryption of a message m is straightforward, if computationally intensive, as $x = m^e \pmod{n}$ . Decryption is very similar in that $m = x^d \pmod{n}$ . ### 1.3 Shor's algorithm From the description of the RSA algorithm it can be noted that once an attacker is able to get p and q he can also easily compute d using the same procedure that is used for key generation. While it is not proved that computing d requires explicitly factoring n, no known method that exploits the availability of e has been published. It is thus paramount that getting p and q from n be extremely time consuming: classical algorithms for factorization require exponential time, and, while the shorter 1024-bit RSA keys might be breakable given enough time and resources, the longer keys up to 4096 bits are still impregnable to any foreseeable attack. In 1994 Peter Shor, at the time working at Bell Laboratories and now professor of mathematics at MIT, devised an algorithm[6] that can efficiently factor any number that is not an integer power of a prime number. Since a requirement of RSA is that p and q are prime and different, the condition for applying Shor's algorithm holds. While the inner workings of the algorithm are out of the scope of this thesis, the general idea of the algorithm is that through quantum operations it is possible to obtain the period of the function $f(x) = a^x \pmod{n}$ , which is in turn directly related to p and p. The algorithm requires $2\log_2(n)$ quantum bits to be effective, and while this amounts to several thousands qubits (a far stretch from the 50 qubits available to the most powerful devices at the beginning of 2018) the number is not inherently prohibitive assuming quantum computing will undergo a similar evolution as classical computing [7]. As a direct consequence of this perceived danger, researches have been devising alternative cyphering systems that are supposedly robust to attacks coming from future quantum computers. # Chapter 2 # The LEDAcrypt cryptosystem The LEDAcrypt cryptosystem, developed by Marco Baldi, Alessandro Barenghi, Franco Chiaraluce, Gerardo Pelosi and Paolo Santini, is actually not one cryptosystem but two. The first one, LEDAkem[1], is a Key Encapsulation Mechanism, while the second one, LEDApkc, is a Private Key Cryptosystem. They are however very similar in concept and implementation, so they will be treated together from here on. LEDAcrypt is built on the McEliece cryptosystem [8], that uses linear codes. The basic idea behind this cryptosystem is that decoding a generic error-correcting code without knowing the decoding function is NP-hard. This in turn requires being able to give a public key for anyone to encrypt a message, while the private key that decodes the message is kept secret and cannot be obtained from the public one. While the McEliece cryptosystem is quite robust, with no known attacks that cannot be neutralized by slight modification of the original system, it has almost never been used due to the sheer dimension of the keys it requires. A standard set of keys for a McEliece cryptosystem can be as big as 500 kb, which is an obvious setback if compared to RSA's 4 kb. Honouring the convention used by the authors of the cryptosystem in the original paper, in this thesis vectors are row vectors unless otherwise specified and transposed vectors are column vectors. ### 2.1 QC-LDPC codes LEDAcrypt uses QC-LDPC (Quasi-Cyclic Low-Density Parity-Check) codes, that are based on quasi-cyclic binary matrices (hence the name). Quasi-cyclic matrices are matrices having circulant blocks: each block can be completely described by its first row. With a block size $p \times p$ and a reasonable p value, this leads to keys more than 25,000 times smaller than they would be if they were not circulant. These quasi-cyclic blocks, however, are also extremely sparse and binary. This property means it is possible to write, for each circulant block, the position of set elements on the first row (knowing their value is one), while everything else is assumed to be zero. A typical block is thus described with a small number of integers and takes up only a few bytes. ### 2.2 LEDAcrypt's keys The particular code used by LEDAcrypt is made up with two matrices forming the private key, from here on called $\mathbf{H}$ and $\mathbf{Q}$ : $$\mathbf{H} = \begin{bmatrix} \mathbf{H_0} & | & \mathbf{H_1} & | & \cdots & | & \mathbf{H_{n_0-1}} \end{bmatrix}$$ (2.1) $$\mathbf{Q} = \begin{bmatrix} \mathbf{Q}_{0,0} & | & \mathbf{Q}_{0,1} & | & \cdots & | & \mathbf{Q}_{0,\mathbf{n}_0-1} \\ \mathbf{Q}_{1,0} & | & \mathbf{Q}_{1,1} & | & \cdots & | & \mathbf{Q}_{1,\mathbf{n}_0-1} \\ \vdots & | & \vdots & | & \ddots & | & \vdots \\ \mathbf{Q}_{\mathbf{n}_0-1,0} & | & \mathbf{Q}_{\mathbf{n}_0-1,1} & | & \cdots & | & \mathbf{Q}_{\mathbf{n}_0-1,\mathbf{n}_0-1} \end{bmatrix}$$ (2.2) Each block $\mathbf{H}_i$ in (2.1) and $\mathbf{Q}_{ij}$ in (2.2) has size $p \times p$ , with p prime: this makes the system immune to a particular type of attack and ensures invertibility of a matrix that will need inversion to compute the public key. Parameter $n_0$ is a small integer, that can be as small as 2. All blocks $\mathbf{H}_i$ of $\mathbf{H}$ have weight (number of set elements) $d_v$ , with a standard choice being 17, while blocks of $\mathbf{Q}$ have a weight according to the following map (which is, by the way, circulant as well): $$\mathbf{W} = \begin{bmatrix} m_0 & | & m_1 & | & \cdots & | & m_{n_0-1} \\ m_{n_0-1} & | & m_0 & | & \cdots & | & m_{n_0-2} \\ \vdots & | & \vdots & | & \ddots & | & \vdots \\ m_1 & | & m_2 & | & \cdots & | & m_0 \end{bmatrix}$$ where $m_i$ are again small integer values. From matrices $\mathbf{H}$ and $\mathbf{Q}$ a new matrix $\mathbf{L}$ is obtained as: $$\mathbf{L} = \mathbf{H}\mathbf{Q} = \begin{bmatrix} \mathbf{L_0} & | & \mathbf{L_1} & | & \cdots & | & \mathbf{L_{n_0-1}} \end{bmatrix}$$ Given a proper choice of parameters $d_v$ and $\underline{m} = [m_0, m_1, ..., m_{n_0-1}]$ the inventors of the cryptosystem have proven that $\mathbf{L}_{\mathbf{n_0}-\mathbf{1}}$ is invertible. This means any possible secret key satisfying the constraints on the parameters can be used to compute a corresponding public key M such that: $$\mathbf{M} = \mathbf{L}_{\mathbf{n_0}-1}^{-1}\mathbf{L} = \begin{bmatrix} \mathbf{M_0} & | & \mathbf{M_1} & | & \cdots & | & \mathbf{M_{n_0-2}} & | & \mathbf{I_p} \end{bmatrix} = \begin{bmatrix} \mathbf{M_l} | \mathbf{I_p} \end{bmatrix}$$ The generator matrix is then obtained as: $$G' = \begin{bmatrix} I_{p(n_0-1)} & | & M_l^T \end{bmatrix}$$ with $\mathbf{M_l^T}$ being the transpose of $\mathbf{M_l}$ . An important thing to notice is that $\mathbf{M}$ , albeit dense and thus not possible to compress as much as $\mathbf{H}$ and $\mathbf{Q}$ , is quasi-cyclic as well. This leads to a public key of size $p(n_0-1)$ bits, as the last p bits of $\mathbf{M}$ are known by construction and $\mathbf{G}'$ is obtained easily from $\mathbf{M_l}$ . ### 2.3 Encryption and decryption The ciphertext $\underline{x}$ of size $1 \times pn_0$ is obtained by multiplying a message $\underline{u}$ of size $1 \times p(n_0 - 1)$ by the generator matrix G' as follows: $$\underline{x} = \underline{u}\mathbf{G}' + \underline{e}$$ with $\underline{e}$ being a purposely introduced error having weight t which is low enough for the code to correct with a very high chance. This is necessary because the first $p(n_0 - 1)$ bits of $\underline{u}\mathbf{G}'$ correspond to $\underline{u}$ itself. The decryption algorithm used by LEDAcrypt is a custom bit-flipping algorithm that succeeds when the syndrome of the code is null (since the fundamental property of the syndrome in linear codes is that it is only null for valid codewords, this effectively amounts to having removed the error). The starting syndrome $\underline{s}$ is computed as $$\underline{s}^T = (\mathbf{HQ})\underline{x}^T$$ and updated with an iterating algorithm, while the error $\underline{e}$ is initialized to a zero vector. The main loop of decryption involves computing a vector $\underline{R}$ such that $$\underline{\Sigma}^{(l)} = \underline{s}^{(l-1)} \mathbf{H}$$ $$\underline{R}^{(l)} = \underline{\Sigma}^{(l)} \mathbf{Q}$$ with $\underline{\Sigma}$ and $\underline{R}$ being vectors of natural numbers, in contrast with every other vector and matrix which are binary. Figure 2.1: Encryption in LEDAcrypt Figure 2.2: Decryption in LEDAcrypt It is now necessary to find the positions in which $\underline{R}^{(l)}$ is maximum, here denoted as set $\mathfrak{J}^{(l)}$ . These positions are the ones that most likely correspond to wrong bits. Flipping bits is not done directly on the received code $\underline{x}$ , but rather the knowledge of $\mathbf{H}$ and $\mathbf{Q}$ allows for direct incremental updating of $\underline{e}$ and $\underline{s}$ . The new value of $\underline{e}$ is obtained as $$\underline{e}^{(l)} = \underline{e}^{(l-1)} + \sum_{v \in \mathfrak{J}^{(l)}} \underline{q}_v$$ with $\underline{q}_v$ being the $v^{\text{th}}$ row of $\mathbf{Q}$ and v being one of the indices corresponding to maximum $\underline{R}^{(l)}$ . Having now the updated error, the updated syndrome is found as $$\underline{s}^{(l)} = \underline{s}^{(l-1)} + \underline{e}^{(l)}\mathbf{H}^T$$ and the algorithm either terminates (due to a null syndrome or exceeding the number of permitted iterations) or starts a new cycle of the main loop. If the algorithm terminated due to null syndrome, the error is then known as the last $\underline{e}^{(l)}$ and it is then easy to get the message as the first part of the corrected code: $$\underline{u}\mathbf{G}' = \underline{x} + \underline{e}^{(l)}$$ $$\underline{u} = (\underline{x} + \underline{e}^{(l)})[0 : p(n0 - 1) - 1]$$ # Chapter 3 # Hardware implementation of LEDAcrypt decryption This thesis is aimed at obtaining a hardware implementation of the decryption system described in chapter 2. The chosen Hardware Description Language was VHDL (VHSIC Hardware Description Language, with VHSIC standing for Very High Speed Integrated Circuits), as the one that the author was most familiar with at the beginning of the work, although it must be noted that a more modern alternative exists in the form of SystemVerilog. The syntaxes of these two languages, while mostly presenting clear parallelisms, are quite different and have different tradeoffs: VHDL is a very mature language, quite limited in core features, very well supported by EDA (Electronic Design Automation) tools but quite pedantic, especially in terms of typing checks and process triggers; SystemVerilog is much more lenient but only a subset of the extensive standard is supported by EDA tools and the additional flexibility necessarily implies additional risk of inadvertently making a mistake that is interpreted as a legal construct. The reason for supporting the decryption specifically is that the operations involved in encrypting are quite cheap to perform on a general purpose processor, as it is a single pass operation consisting of copying a message, padding it with the result of a single vector-by-matrix multiplication and adding a few errors. Decrypting, on the other hand, features multiple vector-by-matrix multiplications, peak detection and vector sums in a loop which could keep the processor busy for longer than it is acceptable. Figure 3.1: Decoder and memory ### 3.1 Assumptions on memory An important detail is that the decryption operation is heavily limited by access to memory. This is due to the extreme reliance of the algorithm on linear algebra using very big matrices and vectors, that have to reside in some kind of RAM: while storing everything in flip-flops is theoretically possible, the parameters giving the least secure implementation would still result in 225,000 flip-flops as a conservative estimate. Because of the huge area it would require, parallel access to the entire vector is impossible: to avoid excessive restraint on the algorithm I assumed a memory that can read two values per cycle and write one is available (as in figure 3.1). This would of course be custom on-chip memory, and in case of an FPGA implementation an emulation could be achieved with parallel writing on multiple memory chips at the cost of increased storage occupation. A single read or write operation per cycle would take almost three times to complete decryption, making dedicated hardware somewhat redundant, and sharing memory with the main processor would likely defeat the purpose entirely by occupying the bus. ### 3.2 System parameters The implementation is completely controlled by parameters, meaning that any legal combination of block size, weights and number of blocks can be implemented by plugging in the desired values. Memory mapping is automatic and has no impact, but it can be easily tweaked too to fit into a larger module featuring secure communication with the processor if need be. Default parameters are as follows: | $n_0$ | p | $d_v$ | $\underline{m}$ | | | |-------|-------|-------|-----------------|--|--| | 2 | 27779 | 17 | [4, 3] | | | with $n_0$ being the number of circulant blocks in matrix $\mathbf{H}$ , $p \times p$ being the size of the circulant blocks, $d_v$ being the weight (number of set elements) of each circulant block of $\mathbf{H}$ and $\underline{m}$ being the first row of matrix $\mathbf{W}$ , that is circulant and contains the weights of the blocks of $\mathbf{Q}$ . The implementation assumes that at the start of decryption matrices $\mathbf{H}$ and $\mathbf{Q}$ , making up the secret key, are loaded into memory, and that the code to decrypt, $\underline{x}$ , is in memory as well. $\mathbf{H}$ is stored as follows: | H BASE ADDRESS + | 0 | 1 | <br>$d_v - 1$ | $d_v$ | <br>$n_0d_v-1$ | |------------------|---------|---------|-------------------|-------------|----------------------| | content | $H_0^T$ | $H_1^T$ | <br>$H_{d_v-1}^T$ | $H_{d_v}^T$ | <br>$H_{n_0d_v-1}^T$ | so that the set positions of the transpose of ${\bf H}$ are what is actually in memory. Similarly, ${\bf Q}$ is stored as: | Q BASE ADDRESS + | 0 | 1 | | $m_0 - 1$ | $m_0$ | • • • | k-1 | |------------------|-------------|-------------|-------|-----------------------|-------------------|-------|---------------| | content | $Q_{0,0}^T$ | $Q_{0,1}^T$ | • • • | $Q_{0,m_0-1}^T$ | $Q_{0,m_0}^T$ | | $Q_{0,k-1}^T$ | | Q BASE ADDRESS + | k | k+1 | | $k + m_{n_0 - 1} - 1$ | $k + m_{n_0 - 1}$ | | 2k - 1 | | content | $Q_{1,0}^T$ | $Q_{1,1}^T$ | • • • | $Q_{1,m_0-1}^T$ | $Q_{1,m_0}^T$ | | $Q_{1,k-1}^T$ | $$k = \sum_{i=0}^{n_0 - 1} m_i$$ with $\underline{m}$ here indicating the first row of $\mathbf{W}^T$ for brevity, as $\mathbf{Q}$ is also transposed. With the default parameters, this amounts to 48 16-bit words of storage (although 15 bits would suffice, if deviating from the standard of using powers of 2 is allowed). For ease of design it was assumed that the bits of the code $\underline{x}$ are accessible one by one by their index, although the design does not enforce that each bit is stored in a 1-bit memory location if a custom memory is not available. This does result in a substantial waste of space, though, and the assumption is as always that we have a custom memory in our chip: in this case this would allow us to have no waste while having access to single bits. # Chapter 4 # **Key reconstruction** The first step needed to decrypt the code is obtaining the syndrome. Since $$\underline{s}^T = \mathbf{L}\underline{x}^T \mapsto \underline{s} = \underline{x}\mathbf{L}^T$$ holds, the objective of this submodule is computing $\mathbf{L}^T$ as $$\mathbf{L} = \mathbf{H}\mathbf{Q} \mapsto \mathbf{L}^T = \mathbf{Q}^T \mathbf{H}^T$$ Handling traditional matrix multiplication, with **H** having size $27779 \times 55558$ and **Q** having size $55558 \times 55558$ at best, is out of question: such a calculation requires $8.5 \cdot 10^{13}$ multiplications and about as many sums and would take ages. The particular format of **H** and **Q**, however, allows for a very efficient implementation. ### 4.1 Circulant block multiplication A binary circulant matrix having only its first element set is the identity matrix, and multiplying any matrix by it results in the starting matrix. A binary circulant matrix having only its second element set circularly shifts all rows of the other operand right by one position, and so on. It follows that the multiplication between two binary circulant matrices, with one having a single set element in the first row, is another binary circulant matrix of the same size. We can then easily extend the result by expressing any binary circulant matrix as the sum of many having a single set element, and state that the product of any two binary circulant matrices is circulant: to ensure it is binary it is sufficient to perform all sums of partial products modulo 2. It is then possible to obtain the product of two circulant blocks ${\bf A}$ and ${\bf B}$ as follows: $$\mathbf{A} = \begin{bmatrix} a_0 & a_1 & \cdots & a_{m-1} \end{bmatrix}$$ $$\mathbf{B} = \begin{bmatrix} b_0 & b_1 & \cdots & b_{n-1} \end{bmatrix}$$ $$\mathbf{AB} = \begin{bmatrix} a_0 + b_0 & a_0 + b_1 & \cdots & a_1 + b_0 & \cdots & a_{m-1} + b_{n-1} \end{bmatrix}$$ with $a_i$ being the position of the $i^{th}$ set element of $\mathbf{A}$ and $b_j$ being the position of the $j^{th}$ set element of $\mathbf{B}$ . The result is the list of set positions in the product, with all sums being performed modulo p to take into account the rotation of set bits "out of the right margin and back into the left one". There is one more problem, however, that is the cancellation of terms: if both $\mathbf{A}$ and $\mathbf{B}$ have set positions [0,1] the product will have set positions [0,2], but this algorithm will output [0,1,1,2]. It is then necessary to eliminate duplicates that appear an even number of times: this implies the actual weight of the result is unknown. Given the added complexity of tracking weights has no real advantage in terms of memory usage, as the space reserved for each operation must be the maximum one possibly occupied by the result, it was chosen to simply fill the unused position with illegal values. The hardware implementation is as follows: - get $a_0$ and $b_0$ , then compute $a_0 + b_0$ and save the result modulo p in a temporary variable - get $b_1$ , compute $a_0+b_1$ , save the result modulo p in a temporary variable - continue until all combinations have been processed, we now have a temporary result in memory - sort the temporary result to have all set positions in order: this is done in place with an insertion sort algorithm[2] that does not require additional space in memory - go through the temporary result and copy all values that are different from the one immediately following them in the memory space for the real result: if two successive values are equal skip them both - if it was not skipped as a result of the previous value, copy the last value of the temporary result into the real result (the previous iteration requires a "next value" to compare to and can't be applied to the last element) - fill all remaining memory location assigned to the result with "invalid" flags, i.e. illegal values: in line with the software implementation, p was used as invalid flag (since the last legal position is p-1) # 4.1.1 Out-of-order result and modulo p implementation The implementation of anything having to do with division is usually very costly in terms of area and performance, either taking multiple cycles to compute a result or needing very big combinational networks to compute the result. The particular problem at hand does once again provide a way to reduce complexity, allowing for a short critical path using little area: $$a_i < p$$ $$b_j < p$$ $$a_i + b_j < 2p - 1$$ This means that there are only two possible solutions to $a_i + b_j \pmod{p}$ : either $a_i + b_j$ or $a_i + b_j - p$ . The second value is computed in the same cycle as the first one and a simple comparator then selects the proper result: this is easily done by checking the sign of $a_i + b_j - p$ , as that is the correct result if it is positive (a circuit performing the operation is shown in figure 4.1). The state machine (depicted in figure 4.2) controlling the actual operation is actually quite simple, depending mostly on parameters known at compile time: it sits in an idle state until a "start" signal is received, at which point two nested loops are performed to select all combinations of $a_i$ and $b_j$ . Values i and j are added as offsets to the base addresses of the two circulant blocks, provided by the parent module that controls the multiplication of $\mathbf{Q}^T$ by $\mathbf{H}^T$ . The address of the result $z_k$ is obtained by summing to the base address of the result block a counter k, incremented in the inner loop and reset when the block is idle. Once the result is ready in memory, the state machine raises a flag and stays in a "done" state until the "start" signal is deasserted: it then moves to an idle state and can perform another multiplication. The implementation computes one result per cycle and writes it immediately to memory, then it moves to the next value on the following cycle. If a real implementation suffers from critical path problems while trying to achieve this, since the circulant multiplication block has no feedback, it can be pipelined without side effects as long as the frequency of the memory can keep up with the frequency of the decoder itself, at which point the memory-bound nature of the problem requires reconsidering how data are stored. Storing data in multiple memories is possible, interleaving access to each of them and thus multiplying the effective maximum frequency of the decoder at the cost of more buses: this is easily done by using the least significant bits of the addresses as computed by the existing modules as inputs to a decoder sum and modulo Figure 4.1: Encryption in LEDAcrypt Figure 4.2: Multiplication between two circulant blocks for memory selection, thus cycling through all memories before coming back to the initial one. A future modification would however need a thorough investigation on the consequence of such a choice on the system at large, to ensure all modules do have sequential access to memory locations: if this assumption does not hold additional logic is required to slow down the decoder when there is the danger of multiple subsequent accesses to the same memory. ### 4.1.2 Result sorting The result as computed up to now is out-of-order, meaning that the list of positions of set bits is not increasing: this is not technically a problem in terms of end result, but efficient implementation of some operations require ensuring that the list is monotonically increasing. As such, a sorting step is needed: the insertion sort was chosen due to the algorithm simplicity (directly translating to hardware complexity and ease of implementation) and because it is an "in place" algorithm that does not require additional memory other than a temporary variable to swap adjacent values. The insertion sort is based on two nested loops, the outer one moving from the start of the list to its end and the inner one moving back until the correct position for the element pointed in the outer loop is found. The algorithm performs the following operations: • from the starting list two lists are built: the first, sorted, is initially made of the first element of the starting list; the second one is all the rest - the first element of the unordered list is compared to the last element of the ordered list - if the new element is bigger than the biggest element of the ordered list, it is appended at the end of the ordered list; if it is not, the biggest element is moved to the end of the list (now one position "right") and the new element is compared with the second biggest one - comparisons continue until the new element is bigger than the old one we are comparing it to or the beginning of the list is reached, then the new element is placed just after the one it was compared against - a new element is taken from the unordered list and the previous steps are repeated until all elements are moved to the sorted list The hardware implementation is extremely simple, consisting of a comparator and a register containing the value being inserted in the current iteration. The value from the sorted sub-list is directly taken from memory, and the smaller of the two is selected by a multiplexer and sent back to memory in the position right after the one in which the already-sorted element resides. A simple control unit takes care of selecting a new element to insert (outer loop, increasing a counter each time the previous element is inserted) and selecting the appropriate values to compare this element against (inner loop, decreasing a second counter that starts one off the current value of the first one). Future improvements to the sorting operation could come from the study of an ad-hoc algorithm tailored to the specific distribution of the out-of-order result, possibly taking advantage of the monotonic segments that are already present to implement a custom merge-sort. No additional investigation was done in this direction. ### 4.1.3 Modulo 2 on compressed matrices Given a circulant block stored in memory as defined in previous sections, any position containing a value n is present n times in the list of set positions. Due to the result being sorted, any position containing a value bigger than once will be present multiple times in adjacent positions in memory, thus allowing for a fast elimination of pairs in a single pass. The elimination of pairs results in positions appearing an odd number of times reduced to appearing only once and positions appearing an even number of times disappearing completely, thus getting a modulo 2 multiplication from the partial result over the natural numbers. The hardware implementation uses two counters to cycle through memory: the first one is used to access two adjacent memory cells to compare their content (an actual synthesis might prefer to have two separate counters offset by 1 and avoid the combinational logic needed to compute the increment), the second one points at the cell where the value is going to be copied. The algorithm is as follows: - Set i, j to 0 (i and j are offsets from the base of the list in memory) - Get the $i^{th}$ and the $(i+1)^{th}$ elements of the list, from here on a and b - If $a \neq b$ copy a in place of the j<sup>th</sup> element of the list, increase i and j; if a = b increase i twice - Repeat until *i* points either to the last element of the list or to the memory cell just after - If i points to the last element of the list copy it in place of the j<sup>th</sup> element of the list - $\bullet$ Fill the rest of the list with data recognizable as invalid, such as p ### 4.2 Circulant block sum Summing two circulant matrices stored in the format in use is simply done by concatenating the list of set positions and then taking the modulo 2 like it is done with the multiplication. A more efficient approach is however possible by merging the sorting and the modulo operation with the concatenation, in order to avoid doing these necessary steps later on. This is done with three different counters used to point an element of the first block, an element of the second and the cell where the result will be stored. The two operand positions are compared: they get discarded if they are the same (this ensures the result is modulo 2 without additional operations), otherwise the smaller one is copied in the result cell and the next position from its block is fetched for the next cycle. The precedence in copying the smaller position first results in the sum being ordered. The exact algorithm is as follows: • Set i, j, k to 0 (offsets from the base of the lists containing the set positions of the first operand, the second operand and the result) - Retrieve the positions pointed by i and j (from here on a and b) and compare them: if a = b increment i and j, if a < b copy a in the memory cell pointed by k and increment i and k, if a > b copy a in the memory cell pointed by k and increment j and k - Repeat until a and b are either invalid or all values have been processed - Fill all remaining positions of the result (if any) with invalid values ### 4.2.1 Memory movement The sum of two circulant blocks as shown above is not done in-place, as there is no way to ensure that no information that is still needed would not be overwritten by the ongoing operation. As such, it is needed to move the result from a temporary location to its final destination. The hardware performing this operation is extremely simple and uses a single counter as offset to two different base positions in memory, copying the values from the first into the second until done. ### 4.3 Quasi-cyclic multiplication The aforementioned submodules implement operations among circulant blocks, which are however not what we are intersted in per se: the objective of the key reconstruction operation is getting $\mathbf{L}^T$ , which is not a single circulant block. As such, we need to show that operations among quasi-cyclic matrices can be expressed as operations on their single circulant blocks. Given that $\mathbf{L}^T = \mathbf{Q}^T \mathbf{H}^T$ , the standard implementation of matrix multiplication would consist in computing the sum of the element-wise multiplication between the $i^{\text{th}}$ row of $\mathbf{Q}^T$ and the $j^{\text{th}}$ column of $\mathbf{H}^T$ to obtain each element $l_{i,j} \in \mathbf{L}^T$ : $$\mathbf{Q}^{T} = \begin{bmatrix} \mathbf{Q}_{0,0} & \mathbf{Q}_{0,1} & \cdots & \mathbf{Q}_{0,n_{0}-1} \\ \mathbf{Q}_{1,0} & \mathbf{Q}_{1,1} & \cdots & \mathbf{Q}_{1,n_{0}-1} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{Q}_{n_{0}-1,0} & \mathbf{Q}_{n_{0}-1,1} & \cdots & \mathbf{Q}_{n_{0}-1,n_{0}-1} \end{bmatrix} = \begin{bmatrix} q_{0,0} & q_{0,1} & \cdots & q_{0,n_{0}p-1} \\ q_{1,0} & q_{1,1} & \cdots & q_{1,n_{0}p-1} \\ \vdots & \vdots & \ddots & \vdots \\ q_{n_{0}p-1,0} & q_{n_{0}p-1,1} & \cdots & q_{n_{0}p-1,n_{0}p-1} \end{bmatrix}$$ $$\mathbf{H}^{T} = \begin{bmatrix} \mathbf{H}_{0} \\ \mathbf{H}_{1} \\ \vdots \\ \mathbf{H}_{n_{0}-1} \end{bmatrix} = \begin{bmatrix} h_{0,0} & h_{0,1} & \cdots & h_{0,p-1} \\ h_{1,0} & h_{1,1} & \cdots & h_{1,p-1} \\ \vdots & \vdots & \ddots & \vdots \\ h_{n_{0}p-1,0} & h_{n_{0}p-1,1} & \cdots & h_{n_{0}p-1,p-1} \end{bmatrix}$$ $$\mathbf{L}^{T} = \begin{bmatrix} \mathbf{L}_{0} \\ \mathbf{L}_{1} \\ \vdots \\ \mathbf{L}_{n_{0}-1} \end{bmatrix} = \begin{bmatrix} l_{0,0} & l_{0,1} & \cdots & l_{0,p-1} \\ l_{1,0} & l_{1,1} & \cdots & l_{1,p-1} \\ \vdots & \vdots & \ddots & \vdots \\ l_{n_{0}p-1,0} & l_{n_{0}p-1,1} & \cdots & l_{n_{0}p-1,p-1} \end{bmatrix}$$ $$l_{i,j} = \sum_{n=0}^{n_{0}p-1} q_{i,n} h_{n,j}$$ From the previous equation, simple algebra shows that: $$l_{i,j} = \sum_{m=0}^{n_0-1} \left( \sum_{n=mp}^{(m+1)p-1} q_{i,n} h_{n,j} \right)$$ While the latter equation is somewhat inelegant, it expresses an important property of the system at hand: it is possible to break up the sum to work with smaller vectors (in our case of size p) without affecting the final result. This is important because multiplying the circulant block $\mathbf{Q}_{x,m}$ by the circulant block $\mathbf{H}_m$ results in: $$\lambda_{i \pmod{p}, j} = \sum_{n=mp}^{(m+1)-1} (q_{i,n}h_{n,j}) \quad i \in [xp, (x+1)p - 1]$$ with $\lambda_{i \pmod{p}, j}$ being the element of $\mathbf{Q}_{x,m}\mathbf{H}_m$ in row $i \pmod{p}$ and column j. It is then possible to expand on this result with: $$\sum_{m=0}^{n_0-1} (\mathbf{Q}_{x,m} \mathbf{H}_m) = \sum_{m=0}^{n_0-1} \left( \sum_{n=mp}^{(m+1)p-1} q_{i,n} h_{n,j} \right) \quad i \in [xp, (x+1)p-1]$$ which is computing all $l_{i,j}$ : $i \in [xp, (x+1)p-1]$ in parallel, using the extremely efficient implementation allowed by the representation of circulant blocks. It can then be noted that: $$\sum_{m=0}^{n_0-1} (\mathbf{Q}_{x,m} \mathbf{H}_m) = \mathbf{L}_x$$ Iterating through $x \in [0, n_0 - 1]$ it is then possible to obtain the full $\mathbf{L}^T$ matrix only using multiplication and sum of circulant blocks and concatenating the results. The hardware implementation uses a request-based system in which each module implementing an operation over circulant blocks is inactive until Figure 4.3: Module computing $\mathbf{L}^T = \mathbf{Q}^T \mathbf{H}^T$ explicitly awoken by the control unit. Each module has moreover as inputs the base position in memory of the circulant blocks on which it will perform the operation (both operands and result, where applicable) and the maximum size of the inputs, needed because the weight of $\mathbf{Q}_{x,m}$ depends on x and m. The outputs to memory of each submodule are multiplexed by the control unit and forwarded up the hierarchy, while the "operation done" signals are used to move between states of a finite state machine without the need to know the exact duration of the operation beforehand. The state machine is as follows: - set i, j to 0 (i, j indexes of circulant blocks in $\mathbf{Q}^T$ and $\mathbf{H}^T$ ) - when instructed to start, multiply $\mathbf{Q}_{i,j}$ by $\mathbf{H}_j$ and put the result in a temporary location - sort the result in place - get the result in modulo 2 in place - if j = 0 copy the result into $\mathbf{L}_i$ , else sum the result to $\mathbf{L}_i$ and save the sum in a temporary location - copy the sum to $\mathbf{L}_i$ - repeat for all j, then set j = 0 and repeat for all i - signal that the operation is done # Chapter 5 # Vector by matrix multiplication (and vice-versa) After retrieving $\mathbf{L}^T$ , all information needed to compute the syndrome of the received code $\underline{x}$ is available. The syndrome is computed as $$s = x\mathbf{L}^T$$ The needed operation is the multiplication of a vector by a matrix, which is recurrent in the main decoding loop too: it is thus paramount to get a performant module that can be time multiplexed and that is flexible enough to be capable of handling different sizes of vectors and matrices. ### 5.1 Vector by circulant matrix As all matrices involved in the decoding operation are concatenations of circulant blocks, the most basic building block would be a module capable of multiplying a vector having length p by a circulant block or vice-versa: the proof that this is sufficient is deferred to individual sections below. For what concerns all operations in this section, it is assumed that $\underline{a}$ is a vector of length p stored in memory as a concatenation of individual values (that might or might not be binary) and $\mathbf{B}$ is a binary circulant block of weight w stored in the usual "set positions" format. Vector $\underline{c}^T$ is the result of $\underline{a}^T\mathbf{B}$ and has size p too. $$\underline{a}^T = \begin{bmatrix} a_0 & a_1 & \cdots & a_{p-1} \end{bmatrix}$$ $$\mathbf{B} = \begin{bmatrix} b_0 & b_1 & \cdots & b_{p-1} \\ b_{p-1} & b_0 & \cdots & b_{p-2} \\ \vdots & \vdots & \ddots & \vdots \\ b_1 & b_2 & \cdots & b_0 \end{bmatrix}$$ $$\underline{c}^T = \begin{bmatrix} c_0 & c_1 & \cdots & c_{p-1} \end{bmatrix}$$ It can be noted that $c_0$ is, trivially: $$c_0 = a_0 b_0 + a_1 b_{p-1} + \dots + a_{p-1} b_1$$ More interestingly, this same relation can be expressed as: $$c_0 = \sum_{n=0}^{p-1} a_n b_{p-n}$$ Similarly: $$c_1 = a_0 b_1 + a_1 b_0 + \dots + a_{p-1} b_2$$ $$c_1 = \sum_{n=0}^{p-1} a_n b_{p-n+1 \pmod{p}}$$ This is finally expanded into a single relation that states: $$c_i = \sum_{n=0}^{p-1} a_n b_{p-n+i \pmod{p}}$$ We thus have a universal analytic expression for the value of any $c_i$ given that **B** is circulant. Due to the sparsity of **B** (we remind that **B** is a block of $\mathbf{H}^T$ , $\mathbf{Q}^T$ or $\mathbf{L}^T$ ) it is possible to entirely avoid operations in which $b_m$ is not set, saving a lot of time. To further optimize the hardware implementation of the operation, due to the sum involving three operands (the accumulator, $a_n$ and $b_m$ ) while we only assumed two read ports were available, $b_m$ (actually m itself, given the way matrices are stored) is read first and stored in a register, and all operations involving that single $b_m$ are computed before moving to the next one. This is more efficient than reading $a_n$ and storing that, as p operations involve $b_m$ while only w operations involve $a_n$ : reading the operands the other way around results in p-w wasted cycles. A loop over n retrieves all $a_n$ and points the affected result $c_{m+n \pmod p}$ , then the next set m (easily found in the next position in memory) is retrieved and stored and the operation is repeated until the last m is reached, at the w<sup>th</sup> iteration. At that point the result is ready and the module signals that the operation is finished. ### 5.2 Circulant matrix by vector We now analyze the case in which $\underline{d} = \mathbf{B}\underline{a}$ : $$\underline{a} = \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{p-1} \end{bmatrix}$$ $$\mathbf{B} = \begin{bmatrix} b_0 & b_1 & \cdots & b_{p-1} \\ b_{p-1} & b_0 & \cdots & b_{p-2} \\ \vdots & \vdots & \ddots & \vdots \\ b_1 & b_2 & \cdots & b_0 \end{bmatrix}$$ $$\underline{d} = \begin{bmatrix} d_0 \\ d_1 \\ \vdots \\ \vdots \end{bmatrix}$$ It can be trivially obtained that: $$d_0 = \sum_{n=0}^{p-1} b_n a_n$$ $$d_1 = \sum_{n=0}^{p-1} b_{n-1 \pmod{p}} a_n$$ And the result can be generalized to: $$d_i = \sum_{n=0}^{p-1} b_{n-i \pmod{p}} a_n$$ This result is very similar to what we obtained in the previous section. Indeed, we can write the two results such that: $$c_i = \sum_{n=0}^{p-1} a_n b_{(i-n) \pmod{p}}$$ $$d_i = \sum_{n=0}^{p-1} a_n b_{-(i-n) \pmod{p}}$$ The hardware that controls the two operations can thus be the same, fixing m (index of $b_m$ ) and sweeping through n to obtain i. The insertion of a simple control signal lets us select wether we want to perform $\underline{a}\mathbf{B}$ , in which case we compute the result address as $i = n + m \pmod{p}$ , or $\mathbf{B}\underline{a}$ , in which case the result address is $i = n - m \pmod{p}$ . The only modification needed to support both operations is thus using an adder-subtractor instead of a simple adder in the target address computation section: the entirety of the control finite state machine is shared. ## 5.3 $\underline{x}$ by $\mathbf{L}^T$ The starting syndrome of the code is: $$\underline{s} = \underline{x} \mathbf{L}^T$$ $$\underline{s} = \begin{bmatrix} s_0 & s_1 & \cdots & s_{p-1} \end{bmatrix}$$ We can write $\underline{x}$ as: $$\underline{x} = \begin{bmatrix} \underline{x}_0 & \underline{x}_1 & \cdots & \underline{x}_{n_0-1} \end{bmatrix} = \begin{bmatrix} x_0 & x_1 & \cdots & x_{p-1} & x_p & \cdots & x_{n_0p-1} \end{bmatrix}$$ $\underline{x}$ is thus split into $n_0$ p-length vectors, while $\mathbf{L}^T$ is by construction split in blocks already. $$\mathbf{L}^T = egin{bmatrix} \mathbf{L}_0 \ \mathbf{L}_1 \ dots \ \mathbf{L}_{n_0-1} \end{bmatrix}$$ By definition, $s_i$ is the sum of the element-wise multiplication between $\underline{x}$ and the $i^{\text{th}}$ column of $\mathbf{L}^T$ . We hereby define $\underline{s}_k$ as: $$\underline{s}_k = \underline{x}_k \mathbf{L}_k$$ Each of such $\underline{s}_k$ is thus a partial sum and we can get then $\underline{s}$ as: $$\underline{s} = \sum_{k=0}^{n_0 - 1} \underline{s}_k$$ It is thus possible to obtain $\underline{s}$ through multiplications of a vector by a circulant matrix, using the module we described in the previous section. Due to the particular implementation of the module, moreover, all multiplications behave as a "multiply and accumulate" operation, meaning there is no need to actually implement the sum thus saving area and execution time. The module performing this operation is thus simply a control unit that provides the base address in memory of the appropriate slice of $\underline{x}$ and of the proper block of $\mathbf{L}^T$ (the latter of which is somewhat complicated by the fact that different blocks of $\mathbf{L}^T$ have different weight, but is resolved with a simple look-up table). The unit then instructs the vector-by-circulant core to perform a vector by matrix multiplication, storing the result in the base address of $\underline{s}$ , and waits for the multiplication core to return to then provide new values for the base addresses of the operands. Once the $n_0^{\text{th}}$ multiplication has returned the control unit itself returns. ## 5.4 $\mathbf{H}^T$ by $\underline{s}^T$ In the main decoding loop the first operation is obtaining $\Sigma$ as: $$\left(\underline{\Sigma}^{(l)}\right)^T = \mathbf{H}^T \left(\underline{s}^{(l-1)}\right)^T$$ $$\left(\underline{\Sigma}^{(l)}\right)^T = egin{bmatrix} \sigma_0 \\ \sigma_1 \\ \vdots \\ \sigma_{n_0p-1} \end{bmatrix}$$ Given that $\mathbf{H}^T$ is: $$\mathbf{H}^T = egin{bmatrix} \mathbf{H}_0 \ \mathbf{H}_1 \ dots \ \mathbf{H}_{n_0-1} \end{bmatrix}$$ and $\underline{s}^T$ has p elements, each $\sigma_i$ can be obtained by multiplying a row of a single block by $\underline{s}^T$ . The result will then be not the sum of many terms like before, but the concatenation: this is simply done incrementing the base position of the matrix-by-vector result by p between operations. $$\left(\underline{\Sigma}^{(l)}\right)^{T} = \begin{bmatrix} \underline{\Sigma}_{0} \\ \underline{\Sigma}_{1} \\ \vdots \\ \underline{\Sigma}_{n_{0}-1} \end{bmatrix}$$ $$\underline{\Sigma}_{i} = \mathbf{H}_{i}\underline{s}^{(l-1)}$$ The hardware implementation is similar to the one used previously, but provides a new base address for the result of each multiplication instead of accumulating over the same one and instructs the multiplication core to perform a circulant-by-vector operation. # 5.5 $\mathbf{Q}^T$ by $\underline{\Sigma}^T$ After obtaining $\Sigma$ , $\underline{R}$ is obtained as: $$\underline{R}^{T(l)} = \mathbf{Q}^T \underline{\Sigma}^{T(l)}$$ $$\underline{R}^{T(l)} = \begin{bmatrix} r_0 \\ r_1 \\ \vdots \\ r_{n_0 p - 1} \end{bmatrix}$$ The operation is more complex as $\mathbf{Q}^T$ is square, thus both concatenation and sum will be needed: The implementation is more complicated than the other ones, but it keeps the same basic principle: two nested loops iterate over j and i providing the base addresses of $\underline{R}_i$ , $\mathbf{Q}_{i,j}$ and $\underline{\Sigma}_j$ , with the appropriate values for the base of circulant blocks provided by a look-up table ## 5.6 $\underline{e}$ by $\mathbf{H}^T$ The last operation in the decoding loop involves computing an increment vector for the syndrome, as $$\underline{\Delta s}^{(l)} = \underline{e}^{(l)} \mathbf{H}^T$$ Since $\underline{e}$ has the same size as $\underline{x}$ and $\mathbf{H}$ has the same size as $\mathbf{L}$ , this operation is exactly equivalent to the multiplication $\underline{s} = \underline{x}\mathbf{L}^T$ . Again, due to the multiplication really behaving as a "multiply and accumulate", the result $\underline{\Delta s}$ is directly summed to $\underline{s}$ with no overhead. # Chapter 6 # Error update Vector $\underline{R}$ contains the count of unsatisfied parity checks in which the corresponding bit of $\underline{x}$ is involved. To proceed with the algorithm, the bits which are most likely to be wrong need to be found. ### 6.1 Peak search Finding the peaks of $\underline{R}$ is done via a simple single-pass sequential algorithm, although it would be possible to parallelize the algorithm by replicating the hardware and adding logic to merge the results. Still, the time consumption of this step is sufficiently small with respect to the total required for the loop (dominated by vector-by-matrix multiplications) that this parallelization was deemed unnecessary for this experimental implementation. The hardware implementation consists of a temporary register containing the current max and an array of fixed, arbitrary size to contain the position of all values equal to the current max. While the size of the array can be changed, having it too small will impact performance and possibly the stability of the algorithm, while having it too big will result in a very high area footprint. The maximum is initialized to 0 and the array is initialized to all invalid positions, then each time a value equal to the maximum is found its position is appended to the array, if there is space left. If the array is full, no operation is performed and the algorithm continues normally. Each time a value greater than the maximum is found, the maximum is updated, the array is flushed and the first value of the array is set to the position of the new maximum. The algorithm then completes once the entirety of $\underline{R}$ has been walked through, and returns the array for usage in the next module. ### 6.2 Row extraction from compressed matrix The next step in the algorithm requires summing to the current error $\underline{e}$ the rows of $\mathbf{Q}^T$ having the index of the found maxima. As we do not have the matrix stored in a readily-available format for this operation (we only have the first row of each block, while we need individual row access), a relation between the set positions in the first row of each module and the set positions in an arbitrary row must be found. One complication is that any row $\underline{q}_k$ stretches over multiple blocks $\mathbf{Q}_{i,j}$ : $$\mathbf{Q}^T = egin{bmatrix} \mathbf{Q}_{0,0} & \mathbf{Q}_{0,1} & \cdots & \mathbf{Q}_{0,n_0-1} \ \mathbf{Q}_{1,0} & \mathbf{Q}_{1,1} & \cdots & \mathbf{Q}_{1,n_0-1} \ dots & dots & \ddots & dots \ \mathbf{Q}_{n_0-1,0} & \mathbf{Q}_{n_0-1,1} & \cdots & \mathbf{Q}_{n_0-1,n_0-1} \end{bmatrix} =$$ $$=\begin{bmatrix} q_{0,0} & q_{0,1} & \cdots & q_{0,p-1} & q_{0,p} & \cdots & q_{0,n_0p-1} \\ q_{1,0} & q_{1,1} & \cdots & q_{1,p-1} & q_{1,p} & \cdots & q_{1,n_0p-1} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ q_{p-1,0} & q_{p-1,1} & \cdots & q_{p-1,p-1} & q_{p-1,p} & \cdots & q_{p-1,n_0p-1} \\ q_{p,0} & q_{p,1} & \cdots & q_{p,p-1} & q_{p,p} & \cdots & q_{p,n_0p-1} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ q_{n_0p-1,0} & q_{n_0p-1,1} & \cdots & q_{n_0p-1,p-1} & q_{n_0p-1,p} & \cdots & q_{n_0p-1,n_0p-1} \end{bmatrix}$$ We can get the $l^{th}$ row of a circulant block **A**: $$\mathbf{A} \begin{bmatrix} a_0 & a_1 & \cdots & a_{p-1} \\ a_{p-1} & a_0 & \cdots & a_{p-2} \\ \vdots & \vdots & \ddots & \vdots \\ a_1 & a_2 & \cdots & a_0 \end{bmatrix}$$ $$\underline{a}_l^T = \begin{bmatrix} a_{p-l \pmod{p}} & a_{p-l+1 \pmod{p}} & a_{p-l+2 \pmod{p}} & \cdots & a_{2p-l-1} \pmod{p} \end{bmatrix}$$ This means it is possible to get any $\underline{q}_k$ as concatenation of rows of the appropriate blocks. The blocks involved are all blocks $\mathbf{Q}_{i,j}$ with i = floor(k/p), while l is obtained as $l = i \pmod{p}$ . ## 6.3 Vector plus compressed row Due to the blocks $\mathbf{Q}_{j,k}$ being stored in compressed format and the sparsity of the rows, it is convenient to handle $\underline{q}_k$ in $n_0$ chunks of length p and maintain the compressed format on the result, in order to have a list of set positions matching the positions that will need to be flipped in the corresponding chunks of $\underline{e}$ . This is easily done reading all l corresponding to set bits in the first row of the block and applying the operation we described in the previous section, with the result being the list of set positions for $\underline{q}_k$ . The actual sum consists in computing i and l in order to get the affected row of blocks and individual row offset, then iterating through the row of blocks one at a time to compute the list of set bits and flipping the corresponding bits in $\underline{e}$ . Once done with a row, the next k is fetched from the list of rows to be summed and the operation is repeated until either all rows have been summed or k is invalid, indicating that the number of peaks in $\underline{R}$ was smaller than the maximum supported by the decoder. The operation is done incrementally in place, so that no additional memory is needed to store intermediate results. # Chapter 7 # Main loop state machine The entirety of the decoder is controlled by the top-level state machine, calling the various functions as they are needed according to the algorithm. This state machine is the bus arbitrator that multiplexes the RAM signals coming from the various blocks and forwards them to the external pins. The state machine performs the following operations: - wait for the "start" signal - ask the key reconstruction module to retrieve $\mathbf{L}^T$ - ask the module performing the $\underline{x}\mathbf{L}^T$ operation to compute the syndrome s - $\bullet$ ask the module performing the $\mathbf{H}^T\underline{s}^T$ operation to compute $\underline{\Sigma}^T$ - ask the module performing the $\mathbf{Q}^T \underline{\Sigma}^T$ operation to compute $\underline{R}^T$ - $\bullet$ ask the peak finder module to compute the positions of the maxima of R - $\bullet$ ask the sum module to add all the rows of $\mathbf{Q}^T$ corresponding to the found maxima to $\underline{e}$ - ask the module performing the $\underline{e}\mathbf{H}^T$ operation to update $\underline{s}$ - clear all temporary results, including $\Sigma$ and R - check whether $\underline{s}$ is null or the iteration limit was reached: if one of the conditions holds compute the message, otherwise loop back to the step that computes $\underline{\Sigma}$ and increase the iteration counter • return the "done" signal; in case the iteration limit was reached, report a decoding failure As multiplication operations are implemented as "multiply and accumulate", results from previous iterations would add up continuously. While this is desirable for certain vectors ( $\underline{s}$ and $\underline{e}$ ), it is disruptive for all the others: as such a memory clear step is performed between iterations zeroing the memory sections containing $\underline{\Sigma}$ and $\underline{R}$ ### 7.1 Design modularity and shared resources Deep emphasis was put in the reutilization of the same basic blocks over and over in orther to save resources, wherever this was possible. The only common blocks that could potentially be shared and were not arranged to be are the mod p operators: this was a deliberate choice to ease code development and readability, while the module itself is reasonably simple so that the area overhead is not too high. In terms of actual implementation, this maps to more raw silicon needed for the gates but less routing and no multiplexing. Modularity was achieved through a "function call" architecture that was devised to make each module accept any data that fit with the template, that being either a vector of length p or a circulant block and its weight. All vectors and blocks are passed by reference as pointers to memory, so that actual data transfer between block is minimals. The algorithm implementing the decoder is not suited to pipelining, but this very drawback is what allows for the resource sharing as all operations performed at different points of the algorithm are ensured not to be called concurrently. Figure 7.1: Decryption module # Chapter 8 ## Conclusions The proposed implementation is but a first step in the study of the feasibility on silicon of cryptographic QC-LDPC codes. While all operations needed as basic blocks to decode the input are simple, well known and efficient, the architecture at large is very much experimental and might present severe bottlenecks in high-frequency operation, especially on the side of data transfers to memory which are paramount to the decoder. Future improvements are likely to come in the form of an additional memory management layer, translating requests to a complex memory structure able to maximize the data rate. This could be done in much the same way as was devised for hard disks with the RAID architecture[9], with multiple separated memory units having independent access that would thus be able to transfer, albeit at slow speed for each transaction, a massive amount of data per cycle. Additionally, while having memory internal to the decoding unit itself would be unfeasible for vectors (all of which have length at least p), the cost of storing internally the compressed $\mathbf{H}^T$ , $\mathbf{Q}^T$ and $\mathbf{L}^T$ matrices is quite low: this makes it possible to have a fast portion of memory that is expected to be accessed in a single cycle even at very high frequencies. Minor improvements in terms of resource sharing can be gained by unifying the control finite state machines performing the multiplications $\underline{s} = \underline{x} \mathbf{L}^T$ and $\underline{\Delta s} = \underline{e} \mathbf{H}^T$ and possibly sharing a single modulo p computation unit. In terms of actual algorithm parallelization and assuming the problem of the memory bottleneck as completely solved, the computation of the unordered result $\mathbf{L}^T = \mathbf{Q}^T \mathbf{H}^T$ can be performed in parallel by simple replication of the processing unit, with the limit being computing all its element in one single pass. Investing bigger area then currently allotted it would also be possible to use faster sorting algorithms like the merge sort[10], while elimination of adjacent doubles from a list to implement the modulo 2 and the sum of matrices could be done with a two-cycle operation operating first on even-odd pairs and then on odd-even ones. Still, all of this only results in speeding up the key reconstruction which happens only once. Parallelizing operations involving vectors is more challenging, due to the sheer size of the vectors themselves. Throughout chapter 5 it was shown that all operations on vectors can be reduced to operations on length p vectors, but p is very big nonetheless. Multiplications with circulant blocks are in essence circular shifts and sums: a system to implement shifts over sections of a vector (as opposed to the entirety of it) can be obtained by simply having multiple units performing the operation. Peak finding can be carried out on segments and the results merged. The row sum operation can be carried out in parallel for each row, although the benefit of doing so is likely minimal. # Appendix A ## Source code ### A.1 Key reconstruction #### Circulant multiplication ``` Flavio Tanese Author: -- Politecnico di Torino 2018 -- Multiply two sparse circulant binary matrices stored in a memory as the — positions of set bits in their first row, and return a "tentative" result - (which is unordered and not simplified modulo 2) in another location in — memory. - Controlling circuitry should take care of keeping all inputs in the - "function arguments" section constant until the module reported back, -- and of ensuring validity of such inputs (no overlapping memory ranges and -- The "start_i" signal should be kept high until "mult_done_o" goes high, — implementing a rudimentary handshake, but this is not strictly required if 15 - the control circuitry operates on the same clock. 16 17 library IEEE; 18 use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; 19 20 library work; 21 use work.system_params.all; 22 use work.matrix_types.all; 23 use work.matrix_mult_functions.all; 24 25 entity circulant_multiplication is 26 port ( 27 - control signals 28 clk_i: in std_logic; in std_logic; 29 rst_n_i: 30 start_i: in std_logic; 31 - function arguments in (not latched) -- number of elements in 1st matrix 32 a_limit_i: in natural; -- base address of 1st matrix 33 a_base_i: in addr; in natural; in addr; 34 b_limit_i: -- number of elements in 2nd matrix 35 -- base address of 2nd matrix b_base_i: z_base_i: in addr; -- base address of result matrix ``` ``` 37 -- data, addresses and controls to memory 38 a_i: in pos; -- operand from 1st matrix -- address for 1st matrix 39 a_addr_o: out addr; in pos; b_i: -- operand from 2nd matrix 40 out addr; -- address for 2nd matrix 41 b_addr_o: -- Ltr result 42 z_{-0}: out pos; z_addr_o: -- address for Ltr result out addr; 43 44 wr_o: out std_logic; -- write enable for memory 45 - report back once done -- number of elements in result matrix 46 z_limit_o: out natural; 47 mult\_done\_o: -- high when done out std_logic 48 end entity circulant_multiplication; 49 50 architecture rtl of circulant_multiplication is 51 52 53 - assume inputs are kept constant by higher level state machine so we do 54 -- not need to sample them on start 55 56 a_index: natural: signal 57 signal b_index: natural: 58 signal z_index: natural; 59 next_a_index: natural: signal 60 signal next_b_index: natural; 61 signal next_z_index: natural; 62 type state_t is (IDLE, BUSY, DONE); 63 64 signal state: state t: 65 signal next_state: state_t: 66 begin 67 68 <= a_i + b_i when a_i + b_i < P else 69 Z_O 70 (a_i + b_i) \mod P; 71 72 \leq a_base_i + a_index; a addr o b_addr_o 73 <= b_base_i + b_index;</pre> 74 z_addr_o z_base_i + z_index; 75 -- count number of elements of result matrix, this can be done with a 76 77 -- multiplier but we are not in a hurry and do not want a big footprint. 78 -- Using z_index we get that for free! 79 z_limit_o \ll z_index; 80 81 state_comb: process( 82 state, start_i, a_limit_i, b_limit_i, next_a_index, next_b_index 83 84 begin 85 case state is 86 when IDLE => if start_i = '1' then 87 88 next_state <= BUSY; 89 else 90 next_state <= IDLE; end if; 91 92 when BUSY => 93 - exit this state only once all combinations have been done if b_{index} \neq b_{limit_i} or a_{index} \neq a_{limit_i} then 94 95 next_state <= BUSY; 96 else 97 next_state <= DONE; end if; ``` ``` when DONE => 99 if start_i = '0' then 100 101 next_state <= IDLE; 102 next_state <= DONE; 103 104 end if; 105 106 next\_state \le IDLE; 107 end case; 108 end process state_comb; 109 state\_seq: \ \, \underline{process} \, \big( \, rst\_n\_i \,\, , \ \, clk\_i \, \big) 110 111 begin 112 if rst_n_i = '0' then state <= IDLE; 113 114 elsif rising_edge(clk_i) then 115 state <= next_state; 116 end if; 117 end process state_seq; 118 119 output_comb: process( 120 state, a_index, b_index, z_index 121 122 begin 123 case state is when IDLE => 124 125 next\_a\_index <= 0; 126 next_b_index = 0; 127 next_z_index <= 0; 128 'o'; wr_{-o} <= 129 mult\_done\_o ,0, <= 130 when BUSY => if a_index < a_limit_i then 131 next_a_index <= a_index + 1; next_b_index <= b_index; 132 133 134 else next_a_index <= 0; next_b_index <= b_index + 1; 135 136 end if; 137 138 n\,e\,x\,t\,{}_{\scriptscriptstyle -}z\,{}_{\scriptscriptstyle -}i\,n\,d\,e\,x <= z_index + 1; 139 wr_{-o} \leq = '1'; 'o'; 140 mult\_done\_o \leq= 141 when DONE => 142 next_a_index <= a_index; 143 next_b_index \leq = b_index: 144 <= z_i n d e x; \verb"next_z-index" <= '0'; <= '1'; 145 wr_{-0} 146 mult\_done\_o 147 when others => - behave like IDLE <= 0; 148 next_a_index 149 next_b_index \leq= 0; 150 next_z_index = 0; <= '0'; <= '0'; 151 wr_{-0} 152 mult\_done\_o 153 end case; 154 end process output_comb; 155 156 output_seq: process(clk_i, rst_n_i) 157 begin if rst_n_i = '0' then 158 a_index <= 0; b_index <= 0; 159 160 ``` ``` 161 z_index <= 0; elsif \ rising\_edge(clk\_i) \ then 162 163 a_index 164 b index 165 z_index <= next_z_index; 166 end if; 167 end process output_seq; 168 169 end architecture rtl; ``` #### A.1.1 Sorting ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; use work.matrix_types.all; 7 use work.matrix_mult_functions.all; 8 entity sort is 10 port ( 11 - control logic 12 clk_i: in std_logic; in std_logic; in std_logic; 13 rst_n_i: 14 start_i: -- function arguments in (not latched) 15 number of elements in array to sort base addr of array to sort limit_i: in natural; 16 17 in addr; base_i: 18 -- data, addresses and controls to memory 19 a_i: in pos; -- element A from array 20 a_addr_o: out addr; -- address of element A -- element B from array 21 b_i: in pos; 22 b_addr_o: out addr; -- address of element B -- element Z to array 23 z_o: out pos; 24 z_addr_o: -- address of element Z out addr; 25 out std_logic; -- write enable for memory 26 -- report back once done 27 sort_done_o: out std_logic -- high when done 28 ); 29 end entity sort; 30 31 architecture rtl of sort is 32 33 - element A will be overwritten in the array, so we need to sample it in -- order to insert it in the right place later on 34 35 signal tmp: pos; 36 signal next_tmp: pos; 37 38 -- iterators signal i: signal j: 39 natural; 40 natural: 41 signal next_i: natural; 42 signal next_j: natural; 43 44 - state machine 45 type state_t is ( IDLE, PREPLINER, CHECKLINNER, LOOPLINNER, 46 47 LAST_INNER, SAVE_0, SAVE_1, DONE ``` ``` 48 49 signal state: state_t; 50 signal next_state: state_t; 51 52 begin 53 54 b\_addr\_o \ <= \ base\_i \ + \ to\_unsigned (j \ , \ address\_bits); 55 56 state_comb: process(state, start_i, next_j, next_i, a_i, b_i) 57 begin 58 case state is when IDLE => 59 if start_i = '1' then 60 next_state <= PREP_INNER; 61 62 else 63 next_state <= IDLE; end if; 64 65 when PREP_INNER => 66 next_state <= CHECK_INNER; when CHECK_INNER => 67 68 if j \neq 0 and b_i > tmp then 69 next_state <= LOOP_INNER; elsif j = 0 and b_i > tmp then 70 71 next_state <= LAST_INNER; 72 73 next_state \le SAVE_1; 74 end if; 75 76 77 when LOOP_INNER => if next_j \neq 0 and a_i > tmp then next_state <= LOOP_INNER; 78 elsif next_j = 0 and a_i > tmp then 79 next_state <= LAST_INNER; 80 81 next\_state \le SAVE_1; 82 end if; when LAST_INNER => 83 84 next_state \le SAVE_0; 85 when SAVE_0 => if i /= limit_i - 1 then 86 next_state <= PREP_INNER; 87 88 n ext_state \le DONE; 89 90 end if; 91 when SAVE_1 => 92 if i /= limit_i - 1 then 93 next_state <= PREP_INNER; 94 else 95 n ext_state \le DONE; 96 end if; 97 when DONE => if start_i = '0' then 98 next_state <= IDLE; 99 100 else 101 next_state <= DONE; end if; 102 103 when others => 104 next_state <= IDLE; 105 end case: 106 end process state_comb; 107 108 state_seq: process(rst_n_i, clk_i) 109 begin ``` ``` if rst_n_i = '0' then 110 111 state \quad <= \quad next\_state \; ; 112 elsif rising_edge(clk_i) then 113 state <= next_state; 114 end if; 115 end process state_seq; 116 117 output_comb: process(state, a_i, b_i, tmp, i, j) 118 begin 119 case state is 120 when IDLE => 121 -- internal signals 122 next\_tmp <= tmp; 123 next_i <= 1; 124 0; next_{-j} <= 125 -- outputs 126 a_addr_o base_i + to_unsigned(i, address_bits); <= 127 z_addr_o <= base_i + to\_unsigned(j + 1, address\_bits); 128 Z_O \leq = ,<sub>0</sub>,; 129 wr o <= 0; 130 sort_done_o <= 131 when PREP_INNER => 132 -- internal signals 133 next_tmp <=\quad a_-i\ ; 134 n\,e\,x\,t\,\_i <= i; i - 1; 135 n ext_{-j} <= 136 -- outputs a_addr_o base\_i \ + \ to\_unsigned (i \ , \ address\_bits ); 137 \leq = base\_i \ + \ to\_unsigned (j \ + \ 1, \ address\_bits); 138 z_addr_o <= 139 Z - O <= b_i; '0'; 140 wr_{-0} <= 141 sort_done_o <= 0; when CHECK_INNER => 142 143 - internal signals 144 next\_tmp \leq= tmp; 145 <= next i i; 146 n e x t_{-j} <= j ; 147 -- outputs 148 a_addr_o base_i + to_unsigned(i, address_bits); <= 149 z_addr_o <= base_i + to_unsigned(j + 1, address_bits); 150 Z_O <= b_i; wr_{-0} <= '0'; 152 sort_done_o <= '0'; 153 when LOOP_INNER => 154 -- internal signals 155 next\_tmp <= tmp; i ; 156 next_i <= 157 n e x t_{-j} <= j - 1; 158 outputs a\_addr\_o 159 <= base_i + to_unsigned(j - 1, address_bits); 160 z_addr_o base_i + to_unsigned(j + 1, address_bits); b_i; 161 \mathbf{z} <= 162 wr_o <= '1'; 163 '0'; sort_done_o <= when LAST_INNER => 164 165 -- internal signals 166 next_tmp <= tmp; next_i <= i ; next_{-j} 168 j; 169 -- outputs base_i + to_unsigned(i, address_bits); a_addr_o 170 <= 171 z_addr_o <= base_i + to_unsigned(j + 1, address_bits);</pre> ``` ``` 172 \leq = b_i; Z_O '1'; 173 wr_-o <= 174 sort_done_o <= 0; when SAVE_0 => 175 176 -- internal signals 177 next_tmp <= tmp; 178 next_i <= i + 1; n\,e\,x\,t_{\,-}j 179 j ; 180 -- outputs base_i + to_unsigned(i, address_bits); 181 a_addr_o <= base_i + to_unsigned(j, address_bits); 182 z_addr_o <= 183 <= \operatorname{tmp}; \mathbf{z} = \mathbf{0} 184 wr_{-o} <= '1'; 0; 185 sort_done_o <= 186 when SAVE_1 \Rightarrow 187 -- internal signals 188 <= \quad \operatorname{tmp}\,; next\_tmp 189 n\,e\,x\,t\,{}_-i <= i + 1; 190 next_{-j} <= j ; 191 --outputs base_i + to_unsigned(i, address_bits); 192 a_addr_o \leq = 193 z_addr_o <= base_i + to_unsigned(j + 1, address_bits); 194 Z_{-}O \leq = tmp; 195 wr_{-0} <= '1'; 196 sort_done_o <= '0'; 197 when DONE \Rightarrow 198 -- internal signals 199 next_tmp <= \quad \operatorname{tmp}\,; 200 next_i \leq = i; 201 next_{-j} <= j ; 202 -- outputs base_i + to_unsigned(i, address_bits); 203 a_addr_o 204 z_addr_o <= base_i + to\_unsigned(j + 1, address\_bits); 205 Z - O <= \operatorname{tmp}; 206 wr_o <= 0; '1'; 207 \verb|sort_done_o| <= 208 when others => -- behave like IDLE 209 - internal signals 210 next_tmp <= tmp; 211 n\,e\,x\,t\,\_i \leq= 1; 212 0; next_i <= 213 -- outputs 214 a_addr_o base_i + to_unsigned(i, address_bits); base\_i + to\_unsigned(j + 1, address\_bits); 215 z_addr_o <= 216 Z_{-}O \leq = b_i; 217 ,<sub>0</sub>; wr_{-0} <= 218 ,<sub>0</sub>,; sort\_done\_o <= 219 end case; 220 end process output_comb; 221 222 output_seq: process(rst_n_i, clk_i) 223 begin if rst_n_i = 0, then 224 225 <= (others => '0'); _{\mathrm{tmp}} <= 0; 226 i 227 <= 0; 228 elsif rising_edge(clk_i) then 229 tmp <= next_tmp; 230 i = next_i; 231 <= \quad n \, e \, x \, t \, \_j \ ; \quad end \quad if \ ; 232 end process output_seq; ``` ``` 234 end architecture rtl; ``` #### A.1.2 Circulant sum ``` library IEEE; use IEEE.std_logic_1164.all; 3 use IEEE.numeric_std.all; library work; 4 use work.system_params.all; 6 use work.matrix_types.all; use work.matrix_mult_functions.all; 9 entity circulant_sum is 10 port ( 11 control signals 12 in std_logic; clk_i: 13 in std_logic; rst_n_i: 14 start_i: in std_logic; -- function arguments in (not latched) 15 -- number of elements in 1st matrix 16 a_limit_i: in natural; -- base address of 1st matrix 17 a_base_i: in addr; 18 b_limit_i: i\, n natural; -- number of elements in 2nd matrix 19 b_base_i: in addr; -- base address of 2nd matrix 20 in addr; z_base_i: -- base address of result matrix 21 - data, addresses and controls to memory 22 a_i: in pos; 23 a_addr_o: out addr; 24 b_i: in pos; 25 b_addr_o: out addr; 26 z_o: out pos; 27 z_addr_o: out addr; 28 out std_logic; -- write enable for memory wr_{-o}: 29 - report back once done 30 z_limit_o: out natural; - number of remaining elements (minus one) 31 sum_done_o: out std_logic 32 33 end entity circulant_sum; 34 35 architecture rtl of circulant_sum is 36 to_unsigned(P, position_bits); 37 constant INVALID_POS: pos:= 38 39 natural: signal i: 40 signal j: natural; natural; 41 signal k : 42 signal n e x t_i: natural; 43 signal next_{-j}: natural; 44 signal next_k: natural; 45 46 i_not_done: std_logic; signal 47 signal j_not_done: std_logic; 48 signal neither_done: std_logic; 49 type state_t is (IDLE, COMPARE, SKIP, COPY_J, COPY_J, FILL_INVALID, DONE); 50 51 signal state: state_t; 52 signal next_state: state_t; 53 54 begin ``` ``` 55 <= '1' when i < a_limit_i and a_i /= INVALID_POS else 56 i\_not\_done 57 0; '1' when j < b_limit_i and b_i /= INVALID_POS else 58 i_not_done 0; 59 60 neither_done <= i_not_done and j_not_done;</pre> 61 62 a_addr_o = a_b a s e_i + i; 63 b_addr_o <= b_base_i + j; 64 z_addr_o <= z_base_i + k; 65 66 z_limit_o \leq = k; 67 68 state_comb: process( state\;,\;\; start\_i\;,\;\; i\_not\_done\;,\;\; j\_not\_done\;,\;\; neither\_done\;, 69 70 k, a_i, b_i 71 72 begin 73 case state is 74 when IDLE => 75 if start_i = '1' then 76 77 next_state <= COMPARE; 78 next\_state \le IDLE; 79 end if; \frac{\text{when COMPARE}}{} => 80 81 if neither_done = '1' and a_i = b_i then next_state <= SKIP; elsif neither_done = '1' and a_i < b_i then</pre> 82 83 next_state <= COPY_I; 84 elsif neither_done = '1' and a_i > b_i then 85 \mbox{next\_state} \quad <= \quad \mbox{COPY\_J} \, ; 86 elsif i_not_done = '1' and not j_not_done = '1' then 87 \begin{array}{lll} \texttt{next\_state} & <= & \texttt{COPY\_I}; \end{array} 88 elsif j_not_done = '1' and i_not_done = '0' then 89 next_state <= COPY_J; 90 elsif i_not_done = '0' and j_not_done = '0' -- cont. 91 92 and k < sum(M)*DV then 93 \label{eq:next_state} \begin{array}{ll} \texttt{next\_state} & <= & \texttt{FILL\_INVALID} \,; \end{array} 94 else 95 next_state <= DONE; end if; 96 97 when SKIP => 98 next_state <= COMPARE; 99 when COPY_I => next\_state <= COMPARE; 100 101 when COPY_J => 102 next_state <= COMPARE; when FILL_INVALID => 103 \mbox{next\_state} \quad <= \quad \mbox{COMPARE}; 105 when DONE => if start_i = '0' then 106 n ext_state \le IDLE; 107 108 else n ext_state <= DONE; 109 end if; 110 111 when others => \operatorname{next\_state} \ <= \ \operatorname{IDLE}; 112 113 end case; 114 end process state_comb; 115 116 state_seq: process(clk_i, rst_n_i) ``` ``` 117 begin if rst_n_i = '0' then 118 119 state <= IDLE; elsif rising_edge(clk_i) then 120 121 state \le next\_state; end if; 122 123 end process state_seq; 124 125 output_comb: process(state, i, j, k, a_i, b_i) 126 begin 127 case state is 128 when IDLE => -- internal signals 129 130 next_i = 0; n\,e\,x\,t\,{}_{\scriptscriptstyle -}j <= 0; 131 132 n\,e\,x\,t\,{}_{\text{-}}k <= 0; 133 -- outputs 134 Z = O <= INVALID_POS; <= '0'; <= '0'; 135 wr_{-o} sum_done_o <= 136 when COMPARE => 137 138 -- internal signals 139 n e x t_i <= i; 140 n e x t_{-j} <= j; 141 next_k <= k; 142 -- outputs 143 <= INVALID_POS; wr_o <= '0'; sum_done_o <= '0'; 144 145 146 when SKIP => -- internal signals 147 <= i + 1; \\ <= j + 1; 148 n\,e\,x\,t\,{}_{\scriptscriptstyle -}i 149 next_{-j} 150 next_k = k; 151 -- outputs <= INVALID_POS; 152 \mathbf{z} <=~~,0~; 153 wr_{-o} 154 sum_done_o <= '0'; when COPY_I => 155 156 -- internal signals 157 next_i <= i + 1; <= \quad j \ ; 158 n e x t_{-j} 159 n\,e\,x\,t\,{}_{\scriptscriptstyle\perp}k <= k + 1; -- outputs 160 161 Z_{-}O \langle = a_i; 162 <= '1'; wr_{-o} 163 sum_done_o <= '0'; 164 when COPY_J => 165 -- internal signals 166 <= i; <= j + 1; next_i 167 n e x t_{-j} 168 <= k + 1; next k 169 -- outputs 170 Z_{-}O <= b_i; <= '1'; 171 wr_o sum_done_o <= '0'; 172 173 when FILL_INVALID => 174 -- internal signals 175 n\,e\,x\,t\,\lrcorner\,i \qquad \quad <= \quad i\ ; <= j; <= k + 1; 176 next_i 177 next_k -- outputs ``` ``` <= INVALID_POS; 179 Z_O 180 wr_{-0} <= '1'; 181 sum\_done\_o <= '0'; when DONE => 182 183 -- internal signals <= i; <= j; 184 next_i 185 next_i 186 next_k <= k + 1; 187 -- outputs <= INVALID_POS; 188 Z_{-}O 189 wr_{-0} <= '0'; '1'; 190 sum_done_o <= 191 when others => -- internal signals 192 193 next_i = 0; 194 n e x t_{-j} \leq= 0; <= 0; 195 next_k 196 -- outputs 197 <= INVALID_POS; ,0; ,0; 198 <= wr o 199 sum_done_o <= 200 end case; 201 end process output_comb; 202 203 output_seq: process(clk_i, rst_n_i) 204 if rst_n_i = '0' then 205 206 i 207 208 209 elsif rising_edge(clk_i) then 210 211 212 k \leq next_k; 213 end if; 214 end process output_seq; 215 216 end architecture rtl; ``` ### A.1.3 Memory copy ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; 5 use work.matrix_types.all; 6 entity mem_copy is port ( 9 - control signals in std_logic; in std_logic; 10 clk_i: 11 rst_n_i: 12 in std_logic; start_i: -- function arguments in (not latched) 13 number of elements in source arraybase addr of source array limit_i: 14 in natural; in addr; 15 a_base_i: 16 in addr; -- base addr of destination array z_base_i: 17 -- data, addresses and controls to memory in pos; 18 -- element A from source array ``` ``` 19 a_addr_o: out addr; -- address of element A 20 -- element Z to destination array z_{-o}: out pos; 21 z_addr_o: out addr; -\!\!- address of element Z out std_logic; -- write enable for memory 22 wr_o: 23 -- report back once done 24 copy_done_o: out std_logic -- high when done 25 26 end entity mem_copy; 27 28 architecture rtl of mem_copy is 29 signal i: signal next_i: 30 natural; 31 natural; 32 type state_t is (IDLE, COPY, DONE); 33 signal state: state_t; signal next_state: state_t; 34 35 36 37 begin 38 <=\ a_{-}b\,a\,s\,e_{-}i\ +\ i\ ; a_addr_o 39 <= z_base_i + i; <= a_i; 40 z_addr_o 41 Z_{-}O 42 43 state_comb: process(state, start_i, i, limit_i) 44 begin 45 case state is 46 when IDLE => if start_i = '1' then 47 48 next_state <= COPY; 49 else 50 \mathtt{next\_state} \ \ <= \ \ \mathrm{IDLE}\,; end if; 51 when COPY => 52 53 if i = limit_i - 1 then 54 next_state <= DONE; 55 56 next_state <= COPY; end if; 57 58 when DONE => if start_i = '0' then 59 next_state <= IDLE; 60 61 next_state <= DONE; 62 end if; 63 when others => 64 65 next_state <= IDLE; 66 end case; 67 end process state_comb; 68 69 state_seq: process(clk_i, rst_n_i) 70 begin 71 if rst_n_i = '0' then 72 state <= IDLE; \begin{array}{ll} \textbf{elsif} & \textbf{rising\_edge} \, (\, \textbf{clk\_i} \,) & \textbf{then} \end{array} 73 74 state \le next\_state; 75 end if; 76 77 end process state_seq; 78 output_comb: process(state, i) 79 begin case state is ``` ``` when IDLE => 81 82 -- internal signals 83 n e x t_i = 0; 84 -- outputs <= '0'; 85 86 copy_done_o <= '0'; 87 when COPY => 88 -- internal signals 89 next_i = + 1; 90 -- outputs <= '1'; 91 wr_{-o} 92 copy_done_o <= '0'; 93 when DONE \Rightarrow 94 -- internal signals 95 n ext_i = 0; 96 -- outputs 97 wr_o 98 \verb"copy-done-o" <= "1"; 99 when others => 100 -- internal signals n e x t_{-i} <= i; 101 102 -- outputs <= '0'; 103 wr_{-0} 104 copy_done_o <= '0'; 105 end case; 106 end process output_comb; 107 108 {\tt output\_seq:} \ \ \, {\tt process} \, (\, {\tt clk\_i} \,\, , \,\, \, {\tt rst\_n\_i} \, ) 109 110 if rst_n_i = '0' then i <= 0; 111 elsif rising_edge(clk_i) then 112 113 i <= next_i; end if; 114 115 end process output_seq; 116 end architecture rtl; ``` ## A.2 Vector by matrix ### A.2.1 Vector by circulant ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; 6 use work.matrix_types.all; entity vector_by_circulant is 9 port ( -- control signals 10 in std_logic; clk_i: 11 in std_logic; 12 rst_n_i: in std_logic; in std_logic; 13 start_i: 14 binary_i: -- function arguments in (not latched) ``` ``` 16 a_base_i: in addr; 17 b_base_i: in addr; 18 b_weight: in natural; 19 z_base_i: in addr: std_logic; -- '0' -> z=aB; '1' -> z=Ba 20 in op: 21 -- data, addresses and controls to memory 22 a_i: in pos; 23 a_addr_o: out addr; 24 b_i: in pos; 25 b_addr_o: out addr; 26 z_o: out pos; 27 z_{\,-}a\,d\,d\,r_{\,-}o: out addr; 28 wr_o: out std_logic; 29 done_o: out std_logic 30 31 end entity vector_by_circulant; 32 33 architecture arch of vector_by_circulant is 34 type state_t is (IDLE, GET_M, INNER_LOOP, DONE); 35 36 signal state: state_t; 37 signal next_state: state_t; 38 39 signal m_index_c, m_index_r: natural; 40 signal m\_c , \ m\_r : pos; 41 signal n_-c , n_-r : addr; 42 signal i_c , i_r: addr; 43 44 begin 45 46 comb: process(state, start_i, binary_i, 47 m\_index\_r\;,\;\; m\_c\;,\;\; m\_r\;,\;\; n\_c\;,\;\; n\_r\;,\;\; i\_c\;,\;\; i\_r\;, op, b_weight, a_base_i, b_base_i, z_base_i, 48 49 a_i , b_i) 50 begin 51 state; next state <= 52 wr_{-o} <= 'o'; 0; 53 done_o <= 54 m_index_c m\_index\_r\;; <= to\_unsigned (0\,,\ address\_bits\,);\\ 55 a_addr_o \leq= 56 b_addr_o \leq= to_unsigned(0, address_bits); 57 to_unsigned(0, position_bits); Z - O \leq = 58 z_addr_o to_unsigned(0, address_bits); <= 59 <= m c m_r; 60 n_c <= n_r; 61 i_c <= i_r; case state is 62 63 when IDLE => 64 m_index_c <= 0; 65 m_{-}c <= \quad to\_unsigned \, (\, 0 \, , \ position\_bits \, ) \, ; <= to_unsigned(0, address_bits); <= to_unsigned(0, address_bits);</pre> 66 n_{-}c 67 i c if start_i = '1' then 68 69 next_state <= GET.M; end if; 70 71 when GET.M => 72 m_index_c m_index_r + 1; \leq = 73 74 b_addr_o <= b_base_i + m_index_r; <= b_i; 75 76 n_c <= to_unsigned(0, address_bits); if op = '0' then i_c <= to_unsigned(to_integer(m_c), address_bits);</pre> ``` ``` 78 else 79 <= \quad to\_unsigned \, (\,\,to\_integer \, (P-m\_c-1) \,, \quad -- \quad cont \,. i _ c 80 address_bits); end if; 81 if b_i = P then 82 83 next_state <= DONE; 84 85 next_state <= INNER_LOOP; 86 end if; 87 when INNER_LOOP => 88 wr_o <= '1'; n_c <= n_r + 1; if i_r = P - 1 then 89 90 91 i_c <= to_unsigned(0, address_bits);</pre> 92 else 93 i_c <= i_r + 1; 94 end if; 95 a_addr_o <=\ a_{-}b\,a\,s\,e_{-}i\ +\ n_{-}c\;; b_addr_o <= z_base_i + i_c; z_addr_o <= z_base_i + i_c; if binary_i = '0' then 96 97 98 99 Z_{-}O = a_i + b_i; 100 else 101 = (0 \Rightarrow a_i(0) \text{ xor } b_i(0), \text{ others } \Rightarrow '0'); 102 end if; 103 if n_r = P-1 then 104 if m_index_r = b_weight then 105 next_state <= DONE; 106 next_state <= GET_M; 107 108 end if; 109 end if; when DONE => 110 done_o <= '1'; if start_i = '0' then 111 112 113 next_state <= IDLE; end if; 114 115 when others => next_state <= IDLE; 116 117 end case; 118 end process comb; 119 120 seq: process(rst_n_i, clk_i) 121 begin if rst_n_i = 0, then 122 123 <= IDLE; state 124 m_index_r = 0; 125 m_r <= to_unsigned(0, position_bits); 126 to_unsigned(0, address_bits); n_r \leq = 127 i_r <= to_unsigned(0, address_bits);</pre> 128 elsif rising_edge(clk_i) then 129 state <= next state:</pre> 130 m_index_r \leq m_index_c; 131 m_r = m_c; <= n_c; 132 n_r 133 i_r <= i_c; 134 end if; 135 end process seq; 136 137 end architecture; ``` ### A.2.2 $\underline{x}$ by $\mathbf{L}^T$ ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; 4 library work; 5 use work.system_params.all; use work.matrix_types.all; use work.matrix_mult_functions.all; 8 use work.memory_map.all; \begin{array}{ll} \textbf{entity} & \textbf{x\_by\_Ltr} & \textbf{is} \\ \end{array} 10 11 port ( 12 control signals 13 clk_{-}i: in std_logic; 14 std_logic; rst_n_i: in in std_logic; 15 start_i: -- controls to multiplication core 16 17 vector_base_o: out addr; 18 block_base_o: out addr: 19 result_base_o: out addr; 20 start_mul_o: out std_logic; mul\_done\_i: 21 in std_logic; 22 - report back 23 done_o: out std_logic ); 24 25 end entity x_by_Ltr; 26 27 architecture rtl of x_by_Ltr is 28 29 type state_t is (IDLE, BUSY, DONE); 30 signal state: state_t; signal next_state: state_t; 31 32 33 natural; signal block_count_c , block_count_r: 34 35 signal vector_base_c , vector_base_r: addr: 36 signal block_base_c , block_base_r: addr; 37 38 39 vector_base_o <= vector_base_r;</pre> 40 41 block_base_o <= block_base_r;</pre> 42 result_base_o \leq S_BASE_ADDR; 43 44 comb: process (state, start_i, 45 block_count_r, block_base_r, vector_base_r, 46 mul_done_i) 47 begin '0'; 48 start_mul_o \leq = 49 next_state <= state; block_count_c 50 block_count_r; <= 51 block_base_c <= block_base_r;</pre> 52 vector_base_c <= vector_base_r; 53 done_o '0'; <= 54 case state is when IDLE => 55 56 block_count_c \ll 0; block_base_c <= LTR_BASE_ADDR; vector_base_c <= X_BASE_ADDR; if start_i = '1' then</pre> 57 58 59 ``` ``` 60 next_state <= BUSY; end if; 61 62 when BUSY => start_mul_o <= '1'; 63 if mul_done_i = '1' then 64 block_count_c <= block_count_r + 1;</pre> 65 -- M*DV is the maximum number of elements in a block of Ltr 66 67 block_base_c <= block_base_r + sum(M)*DV; vector_base_c <= vector_base_r + P; start_mul_o <= '0';</pre> 68 69 70 if block_count_r = N0-1 then 71 next_state \le DONE; 72 end if; 73 end if; 74 when DONE => done_o <= '1'; 75 76 if start_i = 0, then 77 next_state <= IDLE; 78 end if; 79 when others => 80 next_state <= IDLE; 81 end case; 82 end process comb; 83 84 seq: process(clk_i, rst_n_i) 85 begin 86 if rst_n_i = '0' then 87 IDLE; state <= 88 block_count_r \leq= 0; <= to_unsigned(0, address_bits);</pre> 89 block_base_r 90 vector_base_r <= to_unsigned(0, address_bits);</pre> 91 elsif rising_edge(clk_i) then 92 state <= next_state;</pre> 93 block_count_r <= block_count_c;</pre> 94 block_base_r \leq= block_base_c; 95 vector_base_r <= vector_base_c;</pre> end if; 96 97 end process seq; 98 99 end architecture rtl; ``` ## **A.2.3** $\mathbf{H}^T$ by $\underline{s}^T$ ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; 3 4 library work; use work.system_params.all; 6 use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; 9 entity Htr_by_str is 11 port ( 12 -- control signals 13 clk_i: in std_logic; in std_logic; in std_logic; 14 rst_n_i: 15 start_i: - controls to multiplication core ``` ``` 17 vector_base_o: out addr; 18 block_base_o: out addr; 19 result_base_o: out addr; 20 start_mul_o: out std_logic; 21 mul_done_i: in std_logic; 22 - report back 23 done_o: out std_logic 24 25 end entity Htr_by_str; 26 27 architecture rtl of Htr_by_str is 28 29 type state_t is (IDLE, BUSY, DONE); 30 signal state: state_t; 31 signal next_state: state_t; 32 33 signal block_count_c, block_count_r: natural; 34 vector_base_c , vector_base_r: block_base_c , block_base_r: 35 signal 36 addr: signal 37 signal result_base_c , result_base_r: addr; 38 39 begin 40 41 vector_base_o vector_base_r; <= 42 block_base_o <= block_base_r; 43 result_base_o <= result_base_r; 44 comb: process(state, start_i, 45 46 block_count_r, block_base_r, vector_base_r, result_base_r, 47 mul_done_i) 48 begin 49 '0'; start_mul_o <= 50 next_state \leq = state; 51 block_count_c \leq= block_count_r; block_base_r; 52 block_base_c <= 53 vector_base_c <= vector_base_r; 54 result_base_c result_base_r; \leq= 55 done_o '0'; <= 56 case state is 57 when IDLE => block\_count\_c 58 0: \leq = 59 block_base_c \leq= HTR_BASE_ADDR; S_BASE_ADDR: 60 vector_base_c <= <= SIGMA_BASE_ADDR; 61 result_base_c if start_i = '1' then 62 63 next_state <= BUSY; 64 end if; when BUSY => 65 start_mul_o <= '1'; if mul_done_i = '1' then 66 67 68 block_count_c <= block_count_r + 1; <= block_base_r + DV;</pre> 69 block_base_c 70 vector_base_c vector_base_r + P; \leq= result_base_c 71 result_base_r + P; <= 72 start_mul_o \leq = '0'; 73 if block_count_r = N0-1 then 74 75 \label{eq:control_next_state} \begin{array}{ll} \text{next\_state} & <= & DONE; \end{array} end if; 76 77 end if: when DONE => done_o <= '1'; ``` ``` if start_i = '0' then 80 next_state <= IDLE; 81 end if; 82 when others => next_state \le IDLE; 83 84 end case; 85 end process comb; 86 87 seq: process(clk_i, rst_n_i) 88 begin 89 if rst_n_i = '0' then 90 IDLE; state <= 91 block_count_r \leq= 0; block_base_r <= HTR_BASE_ADDR; 92 <= \quad S\_BASE\_ADDR; 93 vector\_base\_r 94 result_base_r \leq= SIGMA_BASE_ADDR; 95 elsif rising_edge(clk_i) then 96 state = next_state; 97 block\_count\_r \leq= block_count_c; 98 block_base_r \leq = block_base_c; 99 vector_base_r \leq= vector_base_c; 100 result_base_r <= result_base_c; end if; 101 102 end process seq; 103 104 end architecture rtl; ``` ## **A.2.4** $\mathbf{Q}^T$ by $\underline{\Sigma}^T$ ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; 6 use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; 10 entity Qtr_by_sigmatr is port ( 11 12 control signals 13 clk_i: in std_logic; std_logic; 14 rst_n_i: in 15 start_i: in std_logic; 16 - controls to multiplication core 17 vector_base_o: out addr; 18 block_base_o: out addr; 19 block_weight_o: out natural; 20 result_base_o: out addr: 21 start_mul_o: out std_logic; 22 mul_done_i: in std_logic; 23 — report back 24 done_{-o}: out std_logic 25 ); end entity Qtr_by_sigmatr; 26 27 28 architecture rtl of Qtr_by_sigmatr is 29 type state_t is (IDLE, BUSY, DONE); ``` ``` 31 signal state: state_t; 32 signal next_state: state_t; 33 34 block_row_c, block_row_r: signal natural: 35 signal block_col_c , block_col_r: natural; 36 37 signal Qtr_block_base_map: 38 addr_matrix (N0-1 downto 0, N0-1 downto 0) get_Qtr_block_base; 39 signal Qtr_block_limit_map: 40 n_matrix (N0-1 downto 0, N0-1 downto 0) get_block_limits; 41 42 vector_base_c , vector_base_r: block_base_c , block_base_r: addr; signal 43 signal addr: result_base_c , result_base_r: 44 signal addr; 45 46 begin 47 48 vector_base_o <= vector_base_r; 49 block_base_o block_base_r; <= 50 block_weight_o <= Qtr_block_limit_map(block_row_r, block_col_r); 51 result_base_o <= result_base_r; 52 53 comb: process (state, start_i, block_base_r , vector_base_r , result_base_r , block_row_r , block_col_r , 54 55 56 mul_done_i) 57 begin 58 '0'; start mul o <= 59 next_state <= state; block_row_c <= block_row_r;</pre> 60 block_col_c <= block_col_r; 61 62 block_base_c \leq= block_base_r; 63 vector_base_c <= vector_base_r;</pre> result_base_c 64 <= result_base_r;</pre> 65 done_o <= '0'; 66 case state is when IDLE => 67 block_row_c 68 <= 0; 69 block_col_c \leq= 0: <= QTR_BASE_ADDR; 70 block_base_c 71 vector_base_c \leq = SIGMA_BASE_ADDR; 72 result_base_c <= R if start_i = '1' then</pre> <= R_BASE_ADDR; 73 74 next_state <= BUSY; 75 end if; 76 when BUSY => 1 BUSY => start_mul_o <= '1'; if mul_done_i = '1' then start_mul_o <= '0';</pre> 77 78 79 if block_col_r /= N0-1 then 80 81 block_col_c \leq block_col_r + 1; <= QTR_BASE_ADDR + -- cont. 82 block base c Qtr\_block\_base\_map (\,block\_row\_r \,, \ block\_col\_r \,); 83 vector_base_c <= vector_base_r + P; result_base_c <= result_base_r;</pre> 84 85 86 else if block_row_r /= N0-1 then 87 88 block_row_c \leq block_row_r + 1; 89 block_col_c <= 0; 90 block_base_c <= QTR_BASE_ADDR + -- cont. 91 Qtr_block_base_map(block_row_r + 1, -- cont. block_col_r); ``` ``` 93 vector_base_c <= SIGMA_BASE_ADDR; 94 result\_base\_c <= result_base_r + P; 95 next\_state <= DONE; 96 end if; 97 98 end if; end if; 99 100 when DONE => 101 done_o <= '1'; if start_i = '0', then 102 103 next_state <= IDLE; end if; 104 105 when others => next_state <= IDLE; 106 end case; 107 108 end process comb; 109 110 seq: process(clk_i, rst_n_i) 111 begin 112 if rst_n_i = '0' then IDLE; 113 state <= 114 block_row_r 0; \leq = block\_col\_r 0; 115 \leq = 116 block_base_r <= HTR_BASE_ADDR; 117 vector_base_r <= S_BASE_ADDR; 118 result_base_r <= SIGMA_BASE_ADDR; 119 elsif rising_edge(clk_i) then 120 state <= next_state; 121 block_base_r \leq= block_base_c; 122 vector_base_r vector_base_c; \leq = result_base_r 123 <= result_base_c; 124 block_row_r \leq= block_row_c; 125 block_col_r <= block_col_c; 126 end if; 127 end process seq; 128 129 end architecture rtl; ``` ### $\mathbf{A.2.5} \quad \underline{e} \ \mathbf{by} \ \mathbf{H}^T$ ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; 4 use work.system_params.all; 6 use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; entity e_by_Htr is 11 port ( 12 control signals 13 clk_i: std_logic; in 14 std_logic; rst_n_i: in 15 start_i: in std_logic; 16 - controls to multiplication core 17 vector_base_o: out addr; 18 block_base_o: out addr; 19 result_base_o: out addr; ``` ``` out std_logic; 20 start_mul_o: 21 mul_done_i: in std_logic; 22 - report back 23 done_o: out std_logic 24 25 end entity e_by_Htr; 26 27 architecture rtl of e_by_Htr is 28 type state_t is (IDLE, BUSY, DONE); 29 30 signal state: state_t; 31 signal next_state: state_t; 32 33 signal block_count_c , block_count_r : natural; 34 vector_base_c , vector_base_r: 35 signal addr; 36 signal block_base_c , block_base_r : addr; 37 38 begin 39 <= vector_base_r; <= block_base_r;</pre> vector_base_o 40 41 block_base_o <= S_BASE_ADDR; result\_base\_o 42 43 44 comb: process(state, start_i, 45 block_count_r, block_base_r, vector_base_r, 46 mul_done_i) 47 begin <= '0'; 48 start_mul_o 49 next_state \leq state; block\_count\_c <= block_count_r; <= block_base_r;</pre> 50 51 block_base_c 52 vector_base_c <= vector_base_r; <= '0'; 53 done_{-}o 54 case state is when IDLE => 55 56 block_count_c \ll 0; block_base_c <= HTR_BASE_ADDR; vector_base_c <= E_BASE_ADDR; if start_i = '1' then 57 58 59 60 next_state <= BUSY; end if; 61 62 when BUSY => start_mul_o <= '1'; if mul_done_i = '1' then 63 64 block_count_c <= block_count_r + 1; 65 <= block_base_r + DV;</pre> 66 block_base_c vector_base_c <= vector_base_r + P; start_mul_o <= '0';</pre> 67 68 \ \ if \ \ block\_count\_r \ = \ N0-1 \ \ then 69 70 next_state <= DONE; 71 end if; 72 end if; 73 when DONE => done_o <= '1'; if start_i = '0' then</pre> 74 75 76 next_state <= IDLE; 77 78 end if; when others => 79 next\_state \le IDLE; 80 end case; end process comb; ``` ``` 82 83 seq: process(clk_i, rst_n_i) 84 begin if rst_n_i = '0' then 85 IDLE; 86 state <= 87 block_count_r 0; <= to_unsigned(0, address_bits); 88 block_base_r <= 89 vector_base_r <= to_unsigned(0, address_bits);</pre> 90 elsif rising_edge(clk_i) then 91 state <= next\_state; 92 block_count_r block_count_c; \leq= 93 block_base_r <= block_base_c; 94 vector_base_r \leq = vector_base_c; 95 end if; 96 end process seq; 97 98 end architecture rtl; ``` ### A.3 Error update #### Peak search ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; 6 use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; 9 10 entity find_max is 11 port ( 12 control signals clk_i: in std_logic; 13 14 r\,s\,t\,{}_{\scriptscriptstyle -}\,n\,{}_{\scriptscriptstyle -}\,i : in std_logic; std_logic; 15 start_i: in 16 -- to memory 17 a_i: in pos; a_addr_o: out addr; 18 19 b_i: in pos; 20 b_addr_o: out addr; 21 z_o: out pos; 22 z_addr_o: out addr; 23 wr_o: out std_logic; 24 -- report back when done 25 max_idx_o: out n_array(15 downto 0); 26 out std_logic done_o: 27 ); 28 end entity find_max; 29 30 architecture rtl of find_max is 31 type state_t is (IDLE, BUSY, DONE); 32 33 signal state: state_t; 34 signal next_state: state_t; 35 ``` ``` 36 signal a_index_c , a_index_r: natural; 37 38 signal max_val_c, max_val_r: pos; n_array(15 downto 0); 39 max_indexes_c, max_indexes_r: signal 40 41 signal i_c , i_r: natural; 42 INVALID: 43 constant natural:= N0*P; 44 45 begin 46 <= R_BASE_ADDR + a_index_r; <= to_unsigned(0, address_bits);</pre> 47 a_addr_o 48 b_addr_o 49 to_unsigned(0, position_bits); Z_{-}O <= <= to_unsigned(0, address_bits); 50 z_addr_o 51 wr_o <= '0'; 52 max_idx_o <= max_indexes_r; 53 54 seq: process(clk_i, rst_n_i) 55 _{\rm begin} if rst_n_i = '0' then 56 57 state < = IDLE; 58 = 0; a_index_r 59 max_val_r <= to_unsigned(0, position_bits);</pre> 60 max\_indexes\_r <= (others => invalid); 61 <= 0; elsif rising_edge(clk_i) then 62 state <= next_state; 63 64 a_index_r \leq= a_index_c; max_val_r 65 \leq \max_{v \in C} \max_{v \in C} c 66 max\_indexes\_r <= max_indexes_c; 67 i_r <= i_c; end if; 68 end process seq; 69 70 comb: process( 71 \mathtt{state} \;,\;\; \mathtt{start\_i} \;, 72 a\_index\_r\ ,\ a\_i\ , 73 max_val_r, max_indexes_r, i_r) 74 begin 75 next\_state <= state; 'o'; 76 done_o \leq= 77 = 0; a_index_c 78 max_val_c \leq \max_{val_r}; 79 max_indexes_c <= max_indexes_r;</pre> <= i_r; 80 i_c 81 case state is 82 when IDLE => 83 max_val_c <= to_unsigned(0, position_bits);</pre> max_indexes_c <= (others => INVALID); 84 85 i _ c <= 0; if start_i = '1' then 86 next_state <= BUSY; 87 end if; 88 89 when BUSY => 90 a\_ind\,e\,x\_c \quad <= \quad a\_ind\,e\,x\_r \; + \; 1; 91 if a_i = max_val_r then 92 max_indexes_c(i_r) \le a_index_r; 93 -- if we have too many equal maxes we'll flip just some 94 if i_r /= 15 then 95 i_c <= i_r + 1; end if; 96 elsif a_i > max_val_r then ``` ``` 98 99 100 end if; 101 \label{eq:continuous_r} \begin{array}{ll} \textbf{if} & \texttt{a\_index\_r} &= & \texttt{N0*P-1} & \textbf{then} \\ \end{array} 102 a_index_c <= 0; next_state <= DONE; 103 104 105 end if; 106 when DONE => done_o <= '1'; if start_i = '0' then 107 108 \label{eq:next_state} \begin{array}{lll} \text{next\_state} & <= & \text{IDLE} \,; \end{array} 109 110 end if; 111 end case; 112 end process comb; 113 114 end architecture rtl; ``` #### Vector plus rows ``` library IEEE; use IEEE.std_logic_1164.all; 2 3 use IEEE.numeric_std.all; library work; 5 use work.system_params.all; 6 use work.matrix_types.all; 7 use work.memory_map.all; 8 use work.matrix_mult_functions.all; 10 entity vector_plus_rows is 11 port ( 12 -- control signals clk_{-i}: in std_logic; 13 14 rst_n_i: in std_logic; 15 start_i: in std_logic; -- row indexes 16 17 row_i dx_i: in n_array(15 downto 0); -- to memory 18 19 a_i: in pos; 20 out addr; a_addr_o: 21 b_i: in pos; 22 b_addr_o: out addr; 23 z_o: out pos; 24 z_addr_o: out addr; 25 out std_logic; wr_{-o}: 26 -- report back when done 27 done_o: out std_logic 28 ); 29 end entity vector_plus_rows; 30 31 architecture rtl of vector_plus_rows is 32 33 type state_t is (IDLE, RELATIVE_ROW, SUM_ROW, DONE); signal state: state_t; signal next_state: state_t; 34 state_t; 35 36 37 signal cur_idx_c: natural; signal cur_idx_r: natural; 38 39 ``` ``` 40 signal block_row_c: natural; 41 signal block_row_r: natural; 42 signal block\_col\_c: natural; block\_col\_r: 43 natural; signal 44 signal rel_row_c: natural; 45 signal rel_row_r: natural; 46 47 signal i_c: natural; 48 signal natural; i_r: 49 50 Qtr\_block\_base: addr; signal 51 signal Qtr_block_count: natural; 53 constant Qtr_block_limit_map: n_{-}matrix := get_block_limits; 54 Qtr_block_base_map: addr_matrix := get_Qtr_block_base; constant 55 56 constant INVALID: natural := N0*P; 57 58 begin 59 Qtr\_block\_count <= Qtr\_block\_limit\_map(block\_row\_r, block\_col\_r); 60 Qtr_block_base <= Qtr_block_base_map(block_row_r, block_col_r); 61 62 63 a_addr_o = E_BASE\_ADDR + block\_col_r*P + ((b_i + rel_row_r) mod P); 64 b_addr_o <= QTR_BASE_ADDR + Qtr_block_base + i_r; (a_i + 1) \mod 2; 65 Z_{-}O <= z_addr_o 66 E_BASE_ADDR + block_col_r*P + ((b_i + rel_row_r) mod P); \leq= 67 seq: process(clk_i, rst_n_i) 68 69 begin 70 if rst_n_i = '0' then 71 state \leftarrow IDLE; 72 cur_idx_r \leq= 0; 73 block_row_r <= 0; 74 block_col_r <= 0; 75 \texttt{rel\_row\_r} \quad <= \quad 0\,; 76 i_r <= 0; 77 elsif rising_edge(clk_i) then 78 state <= next_state;</pre> 79 cur_idx_r \leq= cur_idx_c; 80 block_row_r <= block_row_c; 81 block_col_r <= block_col_c; 82 rel_row_r <= rel_row_c; <= \quad i\, {}_-c \ ; 83 i r end if; 84 end process seq; 85 86 comb: process( 87 state, start_i, 88 block\_row\_r \;, \; block\_col\_r \;, \; rel\_row\_r \;, 89 row_idx_i, cur_idx_r, i_r, 90 Qtr_block_count) begin 91 92 \verb"next_state" <= \verb"state"; 93 cur_idx_c <= cur_idx_r; 94 block_row_c <= block_row_r; 95 block_col_c <= block_col_r; 96 rel_row_c <= rel_row_r; 97 i_c \leq = 0; 98 wr_o '0'; 'o'; 99 done_o <= 100 case (state) is 101 when IDLE => ``` ``` 102 cur_idx_c \ll 0; block_row_c \le 0; 103 block_col_c <= 0; if start_i = '1' then 104 105 \mbox{next\_state} \ \ <= \ \ \mbox{RELATIVE\_ROW}; 106 107 end if; when RELATIVE_ROW => 108 109 if row_idx_i(cur_idx_r) /= INVALID then block_row_c <= row_idx_i(cur_idx_r) / P; rel_row_c <= row_idx_i(cur_idx_r) mod P; 110 111 next_state <= SUMLROW; 112 113 next_state <= DONE; 114 end if; 115 \frac{\text{when SUMLROW}}{} = > 116 <= '1'; 117 wr_{-0} if i_r /= Qtr_block_count then 118 119 i _ c <= i_r + 1; 120 121 if block_col_r \neq N0-1 then b \hspace{.05cm} l \hspace{.05cm} c \hspace{.05cm} k \hspace{.05cm} \_ \hspace{.05cm} c \hspace{.05cm} l \hspace{.05cm} \_ \hspace{.05cm} c \hspace{.05cm} l \hspace{.05cm} \_ \hspace{.05cm} c \hspace{.05cm} l \hspace{.05cm} \_ \hspace{.05cm} r \hspace{.05cm} + \hspace{.05cm} 1; 122 123 if \ cur\_idx\_r = 15 \ then 124 125 next_state \le DONE; 126 else \begin{array}{lll} \mathtt{cur\_idx\_c} & <= & \mathtt{cur\_idx\_r} + 1; \\ \mathtt{block\_col\_c} & <= & 0; \end{array} 127 128 \begin{array}{lll} \text{next\_state} & <= & \begin{array}{ll} \text{RELATIVE\_ROW}; \end{array} 129 130 end if: end if; 131 end if; 132 133 when DONE => <= '1'; 134 done_o if start_i = '0' then 135 136 next_state <= IDLE; 137 138 end case; 139 end process comb; 140 141 end architecture rtl; ``` ### A.4 Loop control and message computation #### Zero syndrome detection ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; entity null_syndrome is port( — control signals ``` ``` 13 clk_i: in std_logic; 14 r\,s\,t _ n _ i : in std_logic; 15 start_i: i\, n std_logic; 16 -- to memory 17 a_i: in pos; 18 a_addr_o: out addr; 19 b_i: in pos; out addr; 20 b_addr_o: 21 z_o: out pos; 22 z_addr_o: out addr; 23 out std_logic; wr_{-o}: -- report back when done 24 25 null_syn_o: out std_logic; 26 done_o: out std_logic 27 28 end entity null_syndrome; 29 30 architecture rtl of null_syndrome is 31 32 type state_t is (IDLE, BUSY, DONE); 33 signal state: state_t; 34 signal next_state: state_t; 35 36 signal bad_syn_c: std_logic; 37 signal bad_syn_r: std_logic; 38 39 signal i_c: natural; 40 signal i_r: natural; 41 42 begin 43 44 a_addr_o <= S_BASE\_ADDR + i_r; S_BASE_ADDR + i_r + 1; 45 b_addr_o \leq = 46 to_unsigned(0, position_bits); Z_{-}O \leq = 47 z_addr_o <= S_BASE_ADDR; 48 <= '0'; wr o 49 null_syn_o <= not bad_syn_r;</pre> 50 51 seq: process(clk_i, rst_n_i) 52 begin 53 if rst_n_i = '0' then 54 <= IDLE; state <= '0'; <= 0; 55 bad_syn_r 56 i_r 57 elsif rising_edge(clk_i) then 58 state <= next_state;</pre> <= bad_syn_c; <= i_c; 59 bad\_syn\_r 60 i_r end if; 61 62 end process seq; comb: process(state, start_i, a_i, b_i, i_r, bad_syn_c, bad_syn_r) 63 64 begin 65 next_state <= state; 66 bad_syn_c bad_syn_r; \leq= 67 i_c <= i_r; 68 done_o \leq= '0'; 69 case state is when IDLE => 70 71 bad_syn_c <= '0'; 72 i_c = 0; 73 if start_i = '1' then next_state <= BUSY; ``` ``` end if; 76 \frac{\mathbf{when}}{\mathbf{BUSY}} \Longrightarrow i_c <= i_r + 2; if a_i /= 0 then 77 78 bad_syn_c <= '1'; 79 end if; if b_i /= 0 and i_r /= P-1 then bad_syn_c <= '1'; 80 81 82 83 end if; if bad_syn_c = '1' or i_r = P-1 then next_state <= DONE;</pre> 84 85 end if; 86 \underline{\text{when}} \ \ DONE => 87 <= '1'; 88 done_{-}o if start_i = '0' then 89 next_state <= IDLE; 90 91 92 end case; 93 end process comb; 94 95 end architecture rtl; ``` #### Intermediate result clear ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; 4 use work.system_params.all; use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; 9 entity clear_temp is 11 \operatorname{port}\left( \right. 12 -- control signals 13 clk_{-i}: in std_logic; in std_logic; in std_logic; 14 rst_n_i: 15 start_i: 16 -- to memory 17 a_i: in pos; 18 a_addr_o: out addr; in pos; 19 b_i: b_addr_o: 20 out addr; 21 z_o: out pos; 22 out addr; z_addr_o: 23 wr_{-o}: out std_logic; 24 -- report back when done 25 done_o: out std_logic 26 27 end entity clear_temp; 28 29 architecture rtl of clear_temp is 30 31 type state_t is ( 32 IDLE, CLEAR_MUL_RES, CLEAR_SUM_TMP, CLEAR_SIGMA, 33 CLEAR_R, DONE 34 35 signal state: state_t; ``` ``` 36 signal next_state: state_t; 37 38 signal i_c: signal i_r: integer; 39 integer; 40 41 begin 42 <= HTR_BASE_ADDR; 43 a_addr_o <= HTR_BASE_ADDR; <= to_unsigned(0, position_bits);</pre> 44 b_addr_o 45 46 47 seq: process(clk_i, rst_n_i) 48 49 if rst_n_i = '0' then state <= IDLE; i_r <= 0; 50 51 elsif rising_edge(clk_i) then 52 state <= next_state; i_r <= i_c; 53 54 end if; 55 56 end process seq; 57 comb: process(state, start_i, i_r) 58 begin 59 i _ c <= i_r; 60 done_o <= '0'; case state is 61 62 when IDLE => <= 0; <= HTR_BASE_ADDR; <= '0'; 63 i_c 64 z_addr_o 65 wr_o if start_i = '1' then 66 next_state <= CLEAR_MUL_RES; 67 end if; 68 69 when CLEAR_MUL_RES => 70 i_c <= i_r + 1; 71 72 73 if i_r = DV*max(M) - 1 then i_c <= 0; next_state <= CLEAR_SUM_TMP; 74 75 76 end if; 77 when CLEAR_SUM_TMP => = i_r + 1; 78 79 80 81 if i_r = DV*sum(M) - 1 then i_c <= 0; next_state <= CLEAR_SIGMA; 82 83 end if; 84 when CLEAR_SIGMA => 85 86 87 88 89 if i_r = N0*P - 1 then 90 = 0; i_c 91 next\_state <= CLEAR\_R; 92 end if; 93 \begin{array}{ll} \textbf{when} & \textbf{CLEAR\_R} \implies \\ \end{array} 94 95 96 if i_r = N0*P - 1 then ``` ``` 98 <= 0; 99 next_state \le DONE; 100 end if; when DONE => 101 102 i _ c <= 0; 103 <= HTR_BASE_ADDR; z_addr_o <= '0'; <= '1'; 104 wr_o 105 done_o 106 if start_i = '0' then next_state \le IDLE; 107 108 109 end case; 110 end process comb; 111 112 end architecture rtl; ``` ## Message computation ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; use work.matrix_types.all; use work.matrix_mult_functions.all; use work.memory_map.all; 9 10 entity comp_message is port ( 11 - control signals 12 clk_i: 13 in std_logic; std_logic; std_logic; 14 rst_n_i: in 15 start_i: in 16 -- to memory 17 a_i: in pos; 18 a_addr_o: out addr; 19 in pos; b_i: 20 b_addr_o: out addr; 21 z_o: out pos; 22 z_addr_o: out addr; 23 wr_{-o}: out std_logic; 24 -- report back when done 25 \mathtt{done\_o}: out std_logic 26 27 end entity comp_message; 28 29 architecture rtl of comp_message is 30 type state_t is (IDLE, BUSY, DONE); 31 32 signal state: state_t; 33 signal next_state: state_t; 34 35 signal i_c , i_r: integer; 36 37 begin 38 <= X_BASE_ADDR + i_r; <= E_BASE_ADDR + i_r;</pre> 39 a_addr_o 40 b_addr_o 41 <= \quad a\_i \quad xor \quad b\_i \; ; Z_{-}O ``` ``` = U_BASE_ADDR + i_r; 42 z_addr_o 43 44 seq: process(clk_i, rst_n_i) 45 begin if rst_n_i = '0' then 46 47 state <= IDLE; \textcolor{red}{\textbf{elsif}} \hspace{0.2cm} \textbf{rising\_edge} \hspace{0.1cm} (\hspace{0.1cm} \textbf{clk\_i}\hspace{0.1cm}) \hspace{0.2cm} \textcolor{red}{\textbf{then}} 48 49 state <= next_state; 50 end if; 51 end process seq; 52 comb: process(state, start_i, i_r) 53 54 next_state <= state; 55 i _ c <= 0; 'o'; 56 <= wr_{-o} {\tt done\_o} 57 \leq= 'o'; case (state) is 58 59 when IDLE => if start_i = '1' then 60 61 next_state <= BUSY; end if; 62 63 when BUSY => <= i_r + 1; <= '1'; 64 i _ c 65 66 if i_r = N0*(P-1) - 1 then n ext_state \le DONE; 67 end if; when DONE => 69 done_o <= '1'; if start_i = '0' then 70 71 72 next_state <= IDLE; 73 end if; 74 end case; 75 end process comb; 76 end architecture rtl; ``` ## A.5 Top module ``` library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; library work; use work.system_params.all; use work.matrix_types.all; use work.matrix_mult_functions.all; entity top is 10 port ( 11 control signals clk_{-}i: 12 in std_logic; 13 in std_logic; rst_n_i: in std_logic; 14 start_i: 15 -- to memory 16 a_i: in pos; 17 a_addr_o: out addr; 18 b_i: out addr; 19 b_addr_o: ``` ``` 20 z_o: out pos; 21 z\_addr\_o: \quad \begin{array}{c} out \ addr; \end{array} 22 wr_{-o}: out std_logic; 23 -- report back when done 24 failure_o: out std_logic; 25 done_o: out std_logic 26 27 end entity top; 28 29 architecture rtl of top is 30 31 component key_reconstruction is 32 port ( 33 -- control signals clk_i: in std_logic; in std_logic; 34 35 rst_n_i: 36 in std_logic; start_i: -- data, addresses and controls to memory 37 38 in pos; a_i: out addr; 39 a_addr_o: 40 b_i: in pos; 41 b_addr_o: out addr; out pos; 42 z_o: 43 z_addr_o: out addr; 44 wr_o: out std_logic; -- report back once done 45 46 key_rec_done_o: out std_logic 47 ); end component key_reconstruction; 48 49 signal kr_start: std_logic; kr_a_addr: 50 signal addr; 51 signal kr_b_addr: addr; 52 signal kr_z: pos; 53 kr_z_addr: addr; signal signal kr_wr: signal kr_done: 54 std_logic; 55 std_logic; 56 57 component vector_by_circulant is 58 port ( 59 - control signals in std_logic; in std_logic; 60 clk_i: 61 rst_n_i: 62 start_i: in std_logic; binary_i: in std_logic; — function arguments in (not latched) 63 64 a_base_i: in addr; 65 in addr; in integer; 66 b_base_i: 67 b_weight: in addr; 68 z_base_i: in std_logic; -- '0' -> z=aB; '1' -> z=Ba 69 - data, addresses and controls to memory 70 in pos; 71 a_i: 72 a_addr_o: out addr; in pos; out addr; 73 b_i: 74 b_addr_o: 75 z_o: out pos; 76 z_addr_o: out addr; 77 78 out std_logic; wr_{-0}: done_o: out std_logic 79 ); end component vector_by_circulant; 80 signal vc_start: std_logic; ``` ``` 82 signal vc_binary: std_logic; 83 signal vc\_vec\_base: addr; 84 signal vc_blk_base: addr; 85 vc_blk_weight: natural; signal 86 signal vc_res_base: addr; 87 signal vc_op: std_logic; vc_aaddr: 88 signal addr: 89 signal vc_b_addr: addr; 90 vc_z: signal pos; 91 signal vc_z_addr: addr; 92 std_logic; signal vc_-wr: 93 signal vc_done: std_logic; 94 95 component x_by_Ltr is 96 port ( 97 control signals std_logic; 98 clk_i: in 99 r\,s\,t _ n _ i : in std_logic; 100 std_logic; start_i: in -- controls to multiplication core 101 102 vector_base_o: out addr; 103 block_base_o: out addr; result_base_o: out addr; 104 105 start_mul_o: out std_logic; 106 mul_done_i: in std_logic; 107 -- report back done_o: 108 out std_logic 109 ); end component x_by_Ltr; 110 111 signal xl_start: std_logic; xl_vec_base: addr; 112 \operatorname{signal} 113 \operatorname{signal} xl_blk_base: addr; 114 signal xl_res_base: addr; xl_start_mul: 115 signal std_logic; 116 signal xl\_done: std_logic; 117 118 component Htr_by_str is 119 port ( 120 control signals 121 clk_i: _{\rm in} std_logic; 122 rst_n_i: std_logic; in 123 start_i: in std_logic; 124 - controls to multiplication core 125 vector_base_o: out addr; 126 block_base_o: out addr; result_base_o: 127 out addr; 128 start_mul_o: out std_logic; 129 mul_done_i: in std_logic; 130 - report back {\tt done\_o:} 131 out std_logic 132 ); 133 end component Htr_by_str; std_logic; 134 signal hs_start: 135 hs_vec_base: addr; signal 136 hs_blk_base: addr: signal 137 signal hs_res_base: addr; 138 signal hs_start_mul: std_logic; 139 hs_done: signal std_logic; 140 141 component Qtr_by_sigmatr is 142 port ( control signals ``` ``` in std_logic; 144 clk_i: 145 rst_n_i: in std_logic; 146 start_i: in std_logic; 147 -- controls to multiplication core 148 vector_base_o: out addr; 149 block_base_o: out addr; 150 block_weight_o: out natural; 151 result_base_o: out addr; 152 start_mul_o: out std_logic; 153 mul_done_i: in std_logic; 154 - report back done_o: 155 out std_logic 156 ); 157 end component Qtr_by_sigmatr; std\_logic; 158 signal qs_start: 159 signal qs_vec_base: addr; 160 qs_blk_base: addr; signal 161 signal qs_blk_weight: natural; 162 qs_res_base: signal addr; 163 qs\_start\_mul: std_logic; signal 164 signal qs_done: std_logic; 165 component find_max is 166 167 port ( 168 control signals 169 clk_i: in std_logic; rst_n_i: 170 std_logic; in 171 start_i: in std_logic; 172 -- to memory 173 a_i: in pos: 174 a_addr_o: out addr; pos; 175 b_i: in 176 b_addr_o: out addr; 177 z_{-0}: out pos; 178 z_addr_o: out addr; out std_logic; 179 wr_o: 180 -- report back when done 181 max_idx_o: out n_array(15 downto 0); 182 out std_logic done_o: 183 ); 184 end component find_max; 185 fm_start: std_logic; signal 186 fm_aaddr: addr; signal 187 \operatorname{signal} fm_b_addr: addr; 188 signal fm_z: pos; 189 signal fm_z_addr: addr; 190 fm_wr: \operatorname{std\_logic}; signal 191 signal fm_max_idx: n_array(15 downto 0); 192 fm_done: std_logic; signal 193 signal max_idx_c: n_array(15 downto 0); 194 max_idx_r: n_array(15 downto 0); signal 195 196 component vector_plus_rows is 197 port ( 198 control signals 199 clk_i: in std_logic; 200 rst_n_i: in std_logic; 201 start_i: in std_logic; 202 -- row indexes 203 row_idx_i: in n_array(15 downto 0); 204 -- to memory a_i: pos; ``` ``` 206 a_addr_o: out addr; 207 b_i: in pos; 208 b_addr_o: out addr; 209 z_o: out pos; z_addr_o: 210 out addr; 211 out std_logic; wr_o: 212 -- report back when done 213 done_o: out std_logic 214 ); 215 end component vector_plus_rows; 216 signal vr_start: std_logic; addr; 217 signal vr_a_addr: 218 signal vr_b_addr: addr: 219 vr_{-z}: signal pos; 220 signal vr_z_addr: addr; 221 signal vr_-wr: std_logic; 222 signal vr_done: std_logic; 223 224 component e_by_Htr is 225 port ( 226 -- control signals 227 clk_i: in std_logic; in std_logic; 228 rst_n_i: 229 start_i: in std_logic; 230 -- controls to multiplication core 231 vector_base_o: out addr; 232 block_base_o: out addr; 233 result_base_o: out addr; 234 start_mul_o: out std_logic; 235 mul_done_i: in std_logic; 236 -- report back 237 done_o: out std_logic 238 ); 239 end component e_by_Htr; 240 signal eh_start: std_logic; addr; 241 eh_vec_base: signal 242 signal eh_blk_base: addr: 243 eh_res_base: addr; signal 244 eh_start_mul: std_logic; signal 245 signal eh_done: std_logic; 246 247 component null_syndrome is 248 port ( 249 - control signals 250 clk_i: in std_logic; 251 in std_logic; rst_n_i: 252 start_i: in std_logic; 253 -- to memory 254 in pos; a_i: 255 a_addr_o: out addr; 256 b_i: in pos; out addr; 257 b_addr_o: 258 z_o: out pos; 259 z_addr_o: out addr; 260 out std_logic; wr_o: 261 - report back when done 262 null_syn_o: out std_logic; 263 out std_logic done_o: 264 265 end component null_syndrome; 266 std_logic; signal ns_start: 267 signal ns_a_addr: addr; ``` ``` 268 signal ns_b_addr: addr; 269 pos; signal ns_{-z}: 270 signal ns_z_addr: addr; 271 std_logic; signal ns_wr: ns_null_syn: 272 signal std_logic; 273 signal ns_done: std_logic; 274 275 component clear_temp is 276 port ( 277 control signals 278 clk_i: in std_logic; 279 rst_n_i: in std_logic; 280 start_i: in std_logic; 281 -- to memory 282 a_i: in pos; 283 a_addr_o: out addr; 284 b_i: in pos; 285 b_addr_o: out addr; 286 z_o: out pos; z_addr_o: 287 out addr; 288 wr_o: out std_logic; 289 - report back when done 290 done_o: out std_logic 291 ); 292 end component clear_temp; 293 signal ct_start: std_logic; 294 signal \operatorname{ct\_a\_addr}: addr; 295 \operatorname{signal} ct_b_addr: addr; 296 signal ct_z: pos; 297 ct_z_addr: signal addr; 298 signal ct_wr: std_logic; 299 \operatorname{signal} ct_null_syn: std_logic; 300 std_logic; signal ct_done: 301 302 component comp_message is 303 port ( 304 - control signals in std_logic; in std_logic; 305 clk_i: 306 rst_n_i: 307 start_i: in std_logic; 308 -- to memory 309 a_i: in pos; 310 a_addr_o: out addr; 311 b_i: in pos; b_addr_o: 312 out addr; 313 z_{-0}: out pos; z_addr_o: out addr; 314 315 wr_o: out std_logic; 316 - report back when done 317 done_o: out std_logic 318 ); end component comp_message; 319 320 std_logic; signal cm_start: 321 signal cm_aaddr: addr; 322 cm_b_addr: signal addr; 323 signal \mathrm{cm}_{-}\mathrm{z}: pos; 324 cm_z_addr: signal addr; 325 cm_wr: std_logic; signal 326 signal cm_done: std_logic; 327 328 type state_t is ( 329 IDLE, ``` ``` 330 COMPUTELTR, COMPUTE_INITIAL_SYNDROME, 331 332 COMPUTE_SIGMA, 333 COMPUTE_R, FIND_B, 334 335 COMPUTE ERROR, 336 COMPUTE SYNDROME, 337 CHECK_SYNDROME, 338 CLEAR_TEMP_AND_LOOP, 339 COMPUTE MESSAGE, 340 CLEAR_TEMP_AND_RETURN, 341 DONE 342 ); 343 signal state: state_t; 344 signal next_state: state_t; 345 346 signal l_c: integer; 347 signal l_r: integer; 348 349 begin 350 351 KR: key_reconstruction 352 port map ( 353 clk_i \Rightarrow clk_i, 354 rst_n_i rst_n_i, => 355 start_{-}i => kr_start , 356 a_{-}i a_i , => 357 a_addr_o kr_a_addr, => 358 b_i => b_i , 359 b_addr_o kr_b_addr, => 360 Z_{-}O => kr_{-}z , 361 z_addr_o => kr_z_addr , 362 wr_o => kr_wr, key_rec_done_o \implies kr_done 363 364 ); 365 366 VC: vector_by_circulant port map ( 367 clk_i , clk_i 368 => 369 rst_n_i \Rightarrow rst_n_i, 370 start_i => vc_start , binary\_i 371 => vc_binary, 372 a\_b\,a\,s\,e\_i => vc_vec_base, 373 b_base_i => vc_blk_base, 374 vc_blk_weight, b_weight => 375 z_base_i => vc_res_base, 376 => vc_op , ^{\mathrm{op}} 377 a_{-}i => a_i , 378 a_addr_o => vc_a_addr, 379 b_i => b₋i, 380 b_addr_o => vc_b_addr, vc_{\,-}z\ , 381 => Z 0 z_addr_o \Rightarrow vc_z_addr, 382 383 wr_o => vc_wr, 384 \Rightarrow vc_done done_o 385 ); 386 387 XL: x_by_Ltr 388 port map ( \Longrightarrow clk_i, 389 clk_i 390 rst_n_i \Rightarrow rst_n_i, 391 start_i \Rightarrow xl_start, ``` ``` vector_base_o \Rightarrow xl_vec_base, 392 393 block_base_o xl_blk_base, => 394 result\_base\_o => xl_res_base, 395 start_mul_o xl_start_mul, => 396 mul_done_i => vc_done, 397 done_o xl_done 398 ); 399 HS: Htr_by_str 400 401 port map ( 402 clk_i clk_i, 403 rst_n_i => rst_n_i , 404 s\,t\,a\,r\,t\,\_\,i => hs_start, hs_vec_base, vector_base_o 405 => block_base_o 406 hs_blk_base, => 407 result\_base\_o => hs_res_base, 408 hs_start_mul, start_mul_o => 409 mul_done_i => vc_done, 410 done_o hs\_done 411 ); 412 413 QS: Qtr_by_sigmatr 414 port map ( 415 c\,l\,k\,\_\,i clk_i , 416 rst_n_i => rst_n_i, 417 start_i => qs_start , 418 vector_base_o qs_vec_base, => 419 block_base_o qs_blk_base, => 420 block_weight_o => qs_blk_weight, 421 result_base_o qs_res_base, => start_mul_o qs_start_mul, 422 => 423 mul_done_i => vc_done, 424 qs_done done_o 425 ); 426 427 FM: find_max 428 port map ( 429 clk_i => clk_i, 430 rst_n_i => rst_n_i, 431 start_i => fm_start, 432 a_i => a_i , 433 a_addr_o => fm_a_addr, 434 b_{-i} b_{-i}, => 435 b_addr_o fm_b_addr, => 436 Z_O => fm_z, 437 z_addr_o fm_z_addr, => 438 fm_-wr , wr_o => 439 max_idx_o => fm_max_idx, 440 fm_done done_o 441 ); 442 443 VR: vector_plus_rows 444 port map ( 445 clk_i clk_i, 446 rst_n_i rst_n_i , => 447 start_i => vr_start , 448 row_idx_i => max_idx_r, 449 a_i => a₋i , 450 a_addr_o vr_a_addr, b_{-i} , 451 b_i => b\_addr\_o 452 vr_b_addr , => Z_{-}O vr_{-z} , ``` ``` z_addr_o 454 => vr_z_addr, 455 wr_{-o} \Rightarrow vr_-wr, 456 \mathtt{done\_o} => vr_done 457 ); 458 459 EH: e_by_Htr 460 port map ( => clk_i, 461 clk_i 462 rst_n_i \Rightarrow rst_n_i, 463 start_i => eh_start, 464 vector_base_o => eh_vec_base, eh_blk_base, 465 block_base_o => result\_base\_o 466 => eh_res_base, 467 start_mul_o eh_start_mul, => 468 mul_done_i => vc_done, 469 done_o => eh_done 470 ); 471 472 NS: null_syndrome 473 port map ( => clk_i, clk_i 474 475 rst_n_i => rst_n_i, 476 => ns_start , start_i 477 a_i => a_i , 478 a_addr_o => ns_a_addr, 479 b_i => \quad b_-i \ , 480 b_addr_o \Rightarrow ns_b_addr, 481 {\rm n\,s\,{\scriptscriptstyle -}z} , => Z_O 482 z_addr_o ns_z_addr, => 483 wr_o => ns_wr, ns_null_syn, 484 null_syn_o => 485 done_o => ns\_done 486 ); 487 488 CT: clear_temp 489 port map ( 490 clk_i \Rightarrow clk_i, 491 rst_n_i rst_n_i, => \Rightarrow ct_start, 492 start_i => a_i , 493 a_i 494 a_addr_o => ct_a_addr, 495 b_i => b<sub>-</sub>i, 496 b_addr_o ct_b_addr, => {\rm ct}_{\,-}{\rm z}\ , 497 Z_O => 498 z_addr_o ct_z_addr , => 499 wr_o ct_wr, => 500 done\_o \Rightarrow ct_done 501 ); 502 503 CM: comp\_message 504 port map ( => clk_i, clk_i 505 506 rst_n_i \Rightarrow rst_n_i, 507 start_i cm_start, => a_{-}i , 508 a_i => 509 a_addr_o => cm_a_addr, 510 b_i , b_i => 511 b_addr_o \Longrightarrow cm_b_addr, 512 Z_{-}O => cm_z, z_addr_o 513 => cm_z_addr, 514 wr_{-0} => cm_wr, 515 _{\rm done\_o} => cm\_done ``` ``` 516 ); 517 518 seq: process(clk_i, rst_n_i) 519 begin if rst_n_i = 0, then 520 <= \quad \mathrm{IDLE}\,; 521 state 522 \langle = (others \Rightarrow 0); max_idx_r 523 = 0; l_r 524 elsif rising_edge(clk_i) then 525 state <= next_state; 526 \leq \max_{i} \operatorname{id} x_{i} c; max_idx_r 527 <= \ l_-c \ ; l_r end if; 528 529 end process seq; -- TODO: put everything in sensitivity list 530 531 comb: process(all) 532 state, start_i, 533 kr\_a\_addr\;,\;\;kr\_b\_addr\;,\;\;kr\_z\;,\;\;kr\_z\_addr\;,\;\;kr\_wr\;,\;\;kr\_done\;, \begin{array}{l} vc\_a\_addr\;,\;\;vc\_b\_addr\;,\;\;vc\_z\;,\;\;vc\_z\_addr\;,\;\;vc\_wr\;,\;\;\\ xl\_start\_mul\;,\;\;xl\_vec\_base\;,\;\;xl\_blk\_base\;,\;\;xl\_res\_base\;,\;\;xl\_done\;, \end{array} 534 535 536 hs\_start\_mul\;,\;\; hs\_vec\_base\;,\;\; hs\_blk\_base\;,\;\; hs\_res\_base\;,\;\; hs\_done\;, 537 538 vr_a_addr, vr_b_addr, vr_z, vr_z_addr, vr_wr, vr_done, eh_start_mul, eh_vec_base, eh_blk_base, eh_res_base, eh_done) 539 540 541 begin 542 next\_state \leq state; a_addr_o <= \quad {\tt to\_unsigned} \; (0 \; , \;\; {\tt address\_bits} \; ) \, ; 543 544 b_addr_o <= to_unsigned(0, address_bits); to_unsigned(0, position_bits); 545 Z_{-}O <= 546 z_addr_o <= to_unsigned(0, address_bits); 547 wr_o <= '0'; '0'; 548 failure_o <= 'o'; 549 done_o \leq = 550 kr_start <= '0'; ,0 ; 551 vc start <= '0'; 552 vc_binarv <= vc_vec_base to_unsigned(0, address_bits); 553 <= 554 vc_blk_base to_unsigned(0, address_bits); \leq = 555 vc_blk_weight <= 556 vc_res_base <= to_unsigned(0, address_bits); 557 vc_op <= '0'; 558 xl_start '0'; <= 'o'; 559 hs start <= 'o'; 560 qs_start <= '0'; 561 fm_start <= 562 max_idx_c <= max_idx_r; 563 vr_start <= '0'; '0'; 564 eh_start <= 'o'; 565 ns_start <= 566 '0'; ct_start ,0; 567 cm start <= 568 l_c l_r; 569 case state is 570 when IDLE => 571 l_{-c} <= 0; if start_i = '1' then 572 \label{eq:compute_transform} \begin{array}{lll} \text{next\_state} & <= & \text{COMPUTELTR}; \end{array} 573 574 end if; when COMPUTELTR => 575 576 a_addr_o \leq kr_a_addr; b_addr_o \leq kr_b_addr; ``` ``` <= k \, r_- z \; ; 578 Z_O 579 z_addr_o <= kr_z_addr; 580 wr_o <= kr_wr; '1'; 581 kr_start \leq = if kr_done = '1' then 582 next_state <= COMPUTE_INITIAL_SYNDROME;</pre> 583 end if: 584 when COMPUTE_INITIAL_SYNDROME => 585 586 a_addr_o \leq vc_a addr; 587 b_addr_o <= vc_b_addr; 588 Z_O <= vc_z; <=\ v\,c_{\,-}z_{\,-}a\,d\,d\,r\;; 589 z_addr_o 590 wr_{-o} <= vc_wr; 591 vc\_start <= xl_start_mul; '1'; 592 vc_binary <= 593 vc\_vec\_base \leq= xl_vec_base; 594 vc_blk_base <= xl_blk_base; 595 vc_blk_weight \leq= sum(M)*DV; 596 vc_res_base xl_res_base; \leq= 597 vc_op <= 'o'; <= '1'; 598 xl_start if xl_done = '1' then 599 next\_state <= COMPUTE\_SIGMA; 600 601 end if; 602 when COMPUTE_SIGMA => 603 a_addr_o <= vc_aaddr; 604 b_addr_o vc_b_addr; <= 605 <= \quad v\,c_{\,-}z\;; Z_{-}O z_addr_o 606 \leq= vc_z_addr; 607 wr_o <= vc_wr; 608 vc\_start \leq = hs_start_mul; 609 vc_binary \leq= '0'; 610 vc_vec_base <= hs_vec_base; vc\_blk\_base 611 \leq= hs_blk_base; 612 vc_blk_weight \leq= DV; 613 vc_res_base <= hs_res_base; 614 vc_op \leq = '1'; 615 '1'; hs_start <= if hs_done = '1' then 616 \mbox{next\_state} \quad <= \quad \mbox{COMPUTE\_R}; \\ 617 618 end if; when COMPUTER => 619 620 a_addr_o \leq vc_aaddr; b_addr_o 621 <= vc_b_addr; 622 Z_{-}O <= vc_z; z_addr_o 623 <= vc_z_addr; 624 <= vc_-wr; wr_{-o} 625 vc\_start \leq= qs_start_mul; 626 'o'; vc_binary \leq= 627 vc\_vec\_base \leq = qs_vec_base; 628 vc\_blk\_base \leq= qs_blk_base; qs_blk_weight + 1; 629 vc_blk_weight <= 630 vc_res_base \leq = qs_res_base; 631 '1'; vc_op \leq= <= '1'; 632 qs_start if qs_done = '1' then 633 634 next_state <= FIND_B; end if; 635 636 when FIND_B => 637 a_addr_o \leq fm_aaddr; \leq fm_b_addr; 638 b_addr_o Z_{-}O <= fm_z; ``` ``` 640 z_addr_o \leq fm_z_addr; 641 wr_{-0} \leq = \operatorname{fm_-wr}; 642 fm_start <= '1'; if fm_done = '1' then 643 max_idx_c <= fm_max_idx; next_state <= COMPUTEERROR;</pre> 644 645 end if; 646 when COMPUTE_ERROR => 647 <= vr_a_addr; <= vr_b_addr; 648 a_addr_o 649 b_addr_o 650 Z_{-}O = vr_z; <= vr_z_addr; 651 z_addr_o 652 wr_{-0} <= vr_wr; <= '1'; 653 vr_start if vr_done = '1' then 654 next_state <= COMPUTESYNDROME; 655 end if; 656 657 when COMPUTESYNDROME => 658 a_addr_o \leq vc_aaddr; = vc_b_addr; b_addr_o 659 660 Z_O <= vc_z; 661 z_addr_o \leq vc_z_addr; <= vc_-wr; 662 wr_o 663 vc\_start <= eh_start_mul; '1'; 664 vc_binary <= 665 vc\_vec\_base <= eh_vec_base;</pre> 666 vc\_blk\_base <= eh_blk_base;</pre> 667 vc_blk_weight \leq = sum(M)*DV; 668 vc_res_base \leq = eh_res_base; 'o'; 669 vc_op \leq = <= '1'; 670 eh_start if eh_done = '1' then 671 next_state <= CHECK_SYNDROME; 672 end if; 673 674 when CHECK_SYNDROME => a_addr_o = ns_aaddr; 675 b_addr_o 676 \leq ns_b_addr; 677 Z_{-}O \leq n s_z; 678 z_addr_o \leq ns_zaddr; 679 wr_{-o} <= ns_wr; 680 ns_start <= '1'; if ns_null_syn = '1' or l_r >= 20 then 681 682 next_state <= COMPUTE_MESSAGE; 683 684 685 next_state <= CLEAR_TEMP_AND_LOOP; 686 end if; 687 end if; when CLEAR_TEMP_AND_LOOP => 688 a_addr_o <= ct_aaddr; 689 690 b_addr_o \leq= ct_b_addr; 691 <= \ c\,t_{\,-}z\;; Z O 692 z_addr_o <= ct_z_addr; 693 wr_{-O} <= ct_-wr; '1'; 694 ct_start <= if ct_done = '1' then 695 696 l _ c <= l_r + 1; next_state <= COMPUTE.SIGMA; 697 698 end if; when COMPUTE_MESSAGE => 699 <= cm_a_addr; 700 a_addr_o b_addr_o \leq cm_b_addr; ``` ``` 702 Z_{-}O <= cm_z; 703 704 705 if cm_done = '1' then 706 707 next_state <= CLEAR_TEMP_AND_RETURN; end if; when CLEAR_TEMP_AND_RETURN => 708 709 710 a_addr_o = ct_aaddr; 711 b\_addr\_o <= \quad ct\_b\_addr\;; 712 Z_{-}O <= ct_z; 713 z_addr_o <= ct_z_addr; 714 <= ct_wr; <= '1'; wr_{-o} 715 ct_start if ct_done = '1' then 716 next_state <= DONE; 717 718 end if; \underline{\text{when DONE}} = > 719 done_o <= '1'; if l_r >= 20 then 720 721 failure_o <= '1'; 722 end if; 723 if start_i = '0' then 724 725 next_state <= IDLE; end if; 726 727 728 null; 729 end case; 730 end process comb; 731 732 end architecture rtl; ``` ## Bibliography - [1] Marco Baldi, Alessandro Barenghi, Franco Chiaraluce, Gerardo Pelosi, and Paolo Santini, "LEDAkem: a post-quantum key encapsulation mechanism based on QC-LDPC codes", arXiv:1801.08867v1 [cs.CR] 26 Jan 2018 - [2] https://en.wikipedia.org/wiki/Insertion\_sort, rev. 11 May 2018 - [3] James F. Kurose, Keith W. Ross, "Computer Networking: a top-down approach", Pearson 2013 - [4] Michele Elia, "An Introduction to Classic Cryptography", Aracne Editrice 2018 - [5] R. Rivest, A. Shamir, L. Adleman, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems", Communications of the ACM, February 1978, pp. 120-126 - [6] Peter Shor, "Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer", arXiv:quant-ph/9508027v2 25 Jan 1996 - [7] Gordon E. Moore, "Cramming more components onto integrated circuits", Electronics, Volume 38, Number 8, April 19, 1965 - [8] Robert J. McEliece, "A Public-Key Cryptosystem Based On Algebraic Coding Theory", DSN Progress Report 42-44, January-February 1978, pp. 114-116 - [9] David A. Patterson, Garth Gibson, Randy H. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)", ACM 1988 - [10] https://en.wikipedia.org/wiki/Merge\_sort, rev. 25 November 2018