Ilaria Pilo
Search With Your Head! A Study of Learned Indices for Hashing and Sequence Alignment.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024
PDF (Tesi_di_laurea)
- Tesi
Restricted to: Repository staff only until 11 April 2025 (embargo date). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) |
Abstract: |
In the era of Machine Learning technology, its impact has extended across various domains, reaching countless applications. One result of this trend is the rise of learned indices, which employ Machine Learning models to enhance lookup. Among these structures, the Recursive Model Index (RMI) has proven to be particularly successful. In this work, we investigate the utilization of learned indices in two specific use cases. In Part I, we discuss the application of learned indices as hash functions, comparing them with many state-of-the-art options. The efficiency of these learned functions strongly correlates with the data distribution, and they can yield up to a 1.4x increase in probe throughput for simple datasets. However, the necessity of both a sort and a training phase makes them unsuitable to support non-partitioned join operations. To hide cache stalls, we propose an implementation of the RMI-based bucket chaining hash table using Instruction Stream Interleaving. This structure increases the probe throughput up to 5x for high load factors, and eliminates the throughput variations between RMIs of different complexity. In Part II, we apply the RMI to the genomic aligner Accel-Align. On 16 cores, while the index introduces an 8.4x slowdown in search time, it provides a precise set of candidate locations, accelerating the embedding phase by 1.5x and the overall alignment by 1.25x. It also increases accuracy to 97.25%, an all-time maximum for Accel-Align. However, the RMI is too precise to the detriment of time, and alternative options with more balanced tradeoffs are available. Our proof-of-concept XXH32 hash table aligner, for instance, surpasses it in terms of time (but with slightly lower accuracy). Both the learned function benchmarking repository [github.com/IlariaPilo/new-hashing-benchmark] and the Accel-Align-RMI implementation [github.com/IlariaPilo/accel-align-rmi] are available on GitHub. |
---|---|
Relators: | Paolo Garza |
Academic year: | 2023/24 |
Publication type: | Electronic |
Number of Pages: | 88 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Ente in cotutela: | INSTITUT EURECOM (FRANCIA) |
Aziende collaboratrici: | Eurecom |
URI: | http://webthesis.biblio.polito.it/id/eprint/31114 |
Modify record (reserved for operators) |