polito.it
Politecnico di Torino (logo)

Search With Your Head! A Study of Learned Indices for Hashing and Sequence Alignment

Ilaria Pilo

Search With Your Head! A Study of Learned Indices for Hashing and Sequence Alignment.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024

[img] PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 11 Aprile 2025 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB)
Abstract:

In the era of Machine Learning technology, its impact has extended across various domains, reaching countless applications. One result of this trend is the rise of learned indices, which employ Machine Learning models to enhance lookup. Among these structures, the Recursive Model Index (RMI) has proven to be particularly successful. In this work, we investigate the utilization of learned indices in two specific use cases. In Part I, we discuss the application of learned indices as hash functions, comparing them with many state-of-the-art options. The efficiency of these learned functions strongly correlates with the data distribution, and they can yield up to a 1.4x increase in probe throughput for simple datasets. However, the necessity of both a sort and a training phase makes them unsuitable to support non-partitioned join operations. To hide cache stalls, we propose an implementation of the RMI-based bucket chaining hash table using Instruction Stream Interleaving. This structure increases the probe throughput up to 5x for high load factors, and eliminates the throughput variations between RMIs of different complexity. In Part II, we apply the RMI to the genomic aligner Accel-Align. On 16 cores, while the index introduces an 8.4x slowdown in search time, it provides a precise set of candidate locations, accelerating the embedding phase by 1.5x and the overall alignment by 1.25x. It also increases accuracy to 97.25%, an all-time maximum for Accel-Align. However, the RMI is too precise to the detriment of time, and alternative options with more balanced tradeoffs are available. Our proof-of-concept XXH32 hash table aligner, for instance, surpasses it in terms of time (but with slightly lower accuracy). Both the learned function benchmarking repository [github.com/IlariaPilo/new-hashing-benchmark] and the Accel-Align-RMI implementation [github.com/IlariaPilo/accel-align-rmi] are available on GitHub.

Relatori: Paolo Garza
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 88
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: INSTITUT EURECOM (FRANCIA)
Aziende collaboratrici: Eurecom
URI: http://webthesis.biblio.polito.it/id/eprint/31114
Modifica (riservato agli operatori) Modifica (riservato agli operatori)