Simone Pistilli
Case study of implementing a variable precision floating-point multiplier using HLS.
Rel. Massimo Poncino. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023
Abstract: |
High precision floating-point computing increases numerical stability and is very useful in scientific computing. The VRP (VaRiable Precision) is a RISC-V accelerator designed to speed up this type of operation (up to 512 bits of mantissa and 18 bits of exponent). This work aims to fix the major bottleneck of this accelerator: the floating point multiplier, which is much slower than the rest of the other hardware operators. The product of the two mantissas currently iterates over a single 64-bit multiplier, and it requires a latency up to 64 clock cycles to compute the result, while the other units requires at most 4 clock cycles. The architecture also introduces a low throughput of 1/64, since it must maintain the input stable during the computation. The goal of this research is to improve the multiplication of mantissas in the multiplier in order to minimize the latency gap between the mantissa multiplier and the other units. To do so, this work explores several fixed-point mantissa multiplication algorithms, comparing them through High-Level-Synthesis (HLS) hardware implementations. This research explores two multiplier algorithms: the Comba and the Karatsuba-Comba algorithms. The first can exploit different implementations, varying the number of 64-bits multipliers, with a throughput ranging from 1/64 to 1/15. The second algorithm uses a pipeline capable of theoretically reducing throughput up to 1/3, but with a much higher area cost. Unfortunately, due to timing reasons and some limits of Vivado HLS, the implementation of this algorithm has some architectural defects. The implementation of the Comba algorithm respects the expected behaviour, and it is implementable inside the VRP unit. Using the Veloce emulator, the floating-point multiplication of the VRP shows an improvement of timing, with a mean latency up to 15.10 and a throughput up to 0.07 (~1/15). The Karatsuba-Comba implementation does not provide a valid design of the multiplier, but with a proper implementation outperforms the throughput performance of the Comba multiplier, reaching a value up to 1/3. To sum up, the Comba multiplier is more reasonable for a HW multiplication with limited area, while the Karatsuba-Comba is more suitable if the precision of the mantissas is high (>=512 bits) and the area cost is not a major issue. |
---|---|
Relators: | Massimo Poncino |
Academic year: | 2022/23 |
Publication type: | Electronic |
Number of Pages: | 123 |
Additional Information: | Tesi secretata. Fulltext non presente |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | CEA - LIST |
URI: | http://webthesis.biblio.polito.it/id/eprint/27669 |
Modify record (reserved for operators) |