Inference optimization of Large Language Models on RISC-V HPC platforms

Javier Jesus Poveda Rodrigo

Inference optimization of Large Language Models on RISC-V HPC platforms.

Rel. Daniele Jahier Pagliari, Mohamed Amine Hamdi, Alessio Burrello. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (9MB) | Preview

Abstract:	Over the past decade, there have been significant improvements in Artificial Intelligence (AI), particularly in the area of natural language processing (NLP), thanks to the emergence of Transformers and, more in general, of Large Language Models (LLMs). These models have enabled numerous deep-learning applications such as translation, text generation, image generation, and many others. However, these transformer-based models present new challenges because of their computationally intensive attention mechanisms and extremely high memory footprint. Even though these types of workloads are typically offloaded to GPUs, there are applications and use cases that require CPU as the workhorse because of its reduced cost and greater flexibility. For instance, while training is too burdensome for CPUs environments, CPUs are suitable for single-example or even batched inference. On the other hand, while the first CPU-based solutions have been developed, extensive research is currently being conducted to further enhance AI applications on these platforms, aiming to fully unlock the already available high computational power. In particular, recent RISC-V many-core SoCs capable of performing High-Performance Computing (HPC) workloads are arising, paving the way to a fully open-source ecosystem based on the RISC-V paradigm. However, these platforms still need more reliable support in their toolchains and libraries, which do not properly target the different hardware platforms or do not reach the same optimization level as their x86/ARM counterparts. For instance, Basic Linear Algebra Subprograms (BLAS) libraries, core in AI-centric workloads, are not optimal for RISC-V and lack support for different Vector extension versions. Further, most available alternatives, such as auto-vectorization, are unreliable or do not support vector units of RISC-V cores. Therefore, this thesis aims to analyze and address the current status of the toolchains and the critical bottlenecks in LLMs inference, trying to improve the inference performance of state-of-the-art LLMs models and frameworks on RISC-V multi-core CPUs. This thesis builds upon the llama.cpp open-source inference framework and its ggml tensor library backend. Our work relies on performing several many-core-aware modifications and optimizations on top of llama.cpp, such as NUMA-aware thread dispatch and tuning of thread spawning based on the computation. Additionally, we propose new GEMM and GEMV implementations, containing kernels able to exploit vector extensions and other basic optimization techniques (e.g. loop unrolling and weight sharing), through which we avoid incurring substantial overheads and can achieve improved performances. We carried out experiments on the MILK-V Pioneer, a system based around the RISC-V CPU Sophon SG2042. The chip features a Network-on-Chip (NoC) architecture with 64 T-Head C920 cores distributed in 16 clusters with 4 NUMA memory regions and three levels of cache memory. Each core is clocked up to 2GHz and is built around the Base Instruction set IMAFDC, with a 9 to 12 stages pipeline with multiple advanced functionalities, including the support of the RISC-V vector instruction extension on its draft version 0.7.1. This thesis demonstrates improved LLMs inference on the Sophon SG2042 compared to the state-of-the-art by exploring the optimal multi-core distribution of the platform, management of the NUMA regions, and including custom kernels, showing the potential of including RISC-V systems in modern HPC systems.
Relatori:	Daniele Jahier Pagliari, Mohamed Amine Hamdi, Alessio Burrello
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	98
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/33331

Modifica (riservato agli operatori)