Politecnico di Torino (logo)

Deep Recommender Models Data Flow Optimization for AI Accelerators

Giuseppe Ruggeri

Deep Recommender Models Data Flow Optimization for AI Accelerators.

Rel. Daniele Jahier Pagliari. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview

Deep Learning-based Recommender Models (DLRMs) have become indispensable tools for businesses to provide effective personalized recommendations to end users. As a result, the workload introduced by these models is extremely relevant, representing, for instance, more than 79% of the AI workload in Meta’s data centers. Therefore, the optimization of such models is crucial and can lead to big energy savings, as well as increased throughput and better real-time responsiveness. State-of-the-art DLRMs present big performance limitations due to embedding layers, which project sparse categorical features to dense, continuous embedding vectors. In particular, the bottleneck is given by the large number of random memory accesses performed to retrieve a multitude of small embedding vectors from look-up tables stored in off-chip memory. To mitigate this issue, some existing approaches exploit the large bandwidth offered by High Bandwidth Memory (HBM), while others propose to build clusters of heterogeneous nodes exploiting the advantages introduced by each platform. Furthermore, some methods propose to model embedding access patterns to place “hot rows” in a cache, and/or to build an entire hierarchical memory system tailored for the embedding lookups. However, existing approaches are limited by the variable size of the models (from a few MBs to hundreds of GBs), as well as their dependency on input query distributions. The goal of this thesis is the study and design of embedding lookup dataflows for the Huawei Ascend AI processors (Ascend 310/910), particularly focusing on exploiting in the most effective way the available software-controlled on-chip buffers (scratchpad memories). More specifically, the work focused on four different strategies that determine which buffer is used to store embedding tables and how lookups are performed. Two of the strategies build on the idea of persistently preloading the tables in one relatively large on-chip L1 buffer to then be able to perform fast lookups from there. Besides, since the AI accelerators are multi-core, the embedding layers workload is split following two different parallelization approaches. One exploits the classical single-instruction-multiple-data (SIMD) paradigm, which splits the input batch evenly across the cores. The other leverages a multiple-instruction-multiple-data (MIMD) paradigm, thus resulting in an asymmetric core execution and unlocking the possibility of splitting the tables into chunks to be preloaded in the L1 buffer of specific cores. Since different strategies are more effective for differently shaped tables, two policy optimization problems are solved through heuristics and greedy solutions. Through extensive experiments using both real embedding tables from a model used in production and synthetic ones, the proposed strategies and policies are compared to a black-box baseline obtained from a sophisticated compiler (ATC) which applies various optimizations and exploits built-in operators written by experts. Results show that the baseline is extremely dependent on the input query distribution, as it faces more than one order of magnitude drop in performance when the input query is fixed. In contrast, the proposed strategies are not only independent from the input query distribution, but they also provide better throughput vs. worst-case latency trade-offs with respect to accessing lookups directly from the global memory for the majority of the considered combinations of table dimensions and input distributions.

Relators: Daniele Jahier Pagliari
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 107
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: Huawei Technologies Switzerland AG
URI: http://webthesis.biblio.polito.it/id/eprint/27690
Modify record (reserved for operators) Modify record (reserved for operators)