Weights Compression for Efficient Convolutional Neural Networks Acceleration on FPGA

Giovanni Cascone

Weights Compression for Efficient Convolutional Neural Networks Acceleration on FPGA.

Rel. Luciano Lavagno, Giovanni Brignone, Roberto Bosio, Teodoro Urso. Politecnico di Torino, Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB) | Preview

Abstract:	Convolutional Neural Networks (CNNs) have significantly advanced image recognition and computer vision. Their growing size and complexity are driven by the need for higher accuracy and the ability to tackle more complex tasks. Larger networks can learn richer and abstract features at various levels, enabling them to recognize not only basic patterns (e.g., edges and textures), but also more complex structures like objects and faces, even in challenging conditions like varying lighting or cluttered environments. The CNNs are scalable, but bigger networks imply more memory and compute capacity. This often becomes a problem, mainly on FPGAs, which have stringent resource restrictions, and this work tries to tackle these constraints for effective implementation. This thesis addresses the challenge of deploying large Convolutional Neural Networks, such as MobileNet or ResNet-50, on FPGAs, and to overcome the limited on-chip memory capacity, the network weights (which constitute the majority of the memory footprint) are first compressed offline using entropy-based techniques and stored in external DDR. The compressed weights are then transferred to the FPGA’s on-chip BRAM and decompressed in hardware using a dedicated decompressor (implemented in Vitis HLS) before being fed directly to the convolutional layers for computation. This approach allows larger neural networks to fit within FPGAs that would otherwise support only smaller models. By reducing the memory footprint of the weights and the required bandwidth between the FPGA and external memory, the method significantly improves system scalability. However, the decompression process introduces challenges, particularly in terms of throughput, which can become a bottleneck for real-time applications. Various compression techniques were explored from the literature, including pruning, weight clustering, low-rank factorization, arithmetic coding, and RLE, analyzing their pros and cons. Ultimately, encoding-based methods were chosen as they provide decent compression ratios while being lossless, ensuring no accuracy loss after decompression. Specifically, Gzip compression was used, which employs the Deflate algorithm combining LZ77 and dynamic Huffman coding to balance efficiency and feasibility. Initially, the approach was tested on smaller networks such as ResNet-8, where both compression and decompression were evaluated. This was done for a single layer, but it can be extended to all layers with appropriate parallelization. For larger networks, only the compression of already quantized weights (fixed-point 8-bit) was tested, compressing the weights layer by layer (considering only convolutional layers) and computing a weighted average of the compression factors to determine the overall compression ratio across the entire network. This resulted in an average total compression of approximately 30-35%. Future work could focus on improving throughput to further optimize the deployment of large-scale CNNs on FPGAs.
Relatori:	Luciano Lavagno, Giovanni Brignone, Roberto Bosio, Teodoro Urso
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	66
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-25 - INGEGNERIA DELL'AUTOMAZIONE
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/35294

Modifica (riservato agli operatori)