polito.it
Politecnico di Torino (logo)

Towards User-Friendly NoSQL: A Synthetic Dataset Approach and Large Language Models for Natural Language Query Translation

Alessandro Tola

Towards User-Friendly NoSQL: A Synthetic Dataset Approach and Large Language Models for Natural Language Query Translation.

Rel. Lorenzo Bottaccioli, Alessandro Aliberti, Edoardo Patti. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (866kB) | Preview
Abstract:

This thesis addresses contemporary challenges in managing extensive datasets, with a specific focus on the transition from traditional relational databases to non-relational databases (NoSQL). The focus is on enhancing the accessibility of NoSQL databases for non-expert users through natural language queries. Recognizing the prevalence of non-relational databases across industries and the imperative for effective natural language interfaces, the primary contributions of this research include the introduction of a syn- thetic dataset creation method and the utilization of Large Language Models (LLMs) for natural language to NoSQL translation. This decision stems from the recognition of the absence of an existing dataset tailored to the specific requirements of the research. The dataset, created for NL-to-SQL translation incorporates the WikiSQL dataset, leverages Query templates, NL templates, and data augmentation strategies. This method incorporates learnings from established methodologies to guide the process of creating the synthetic dataset effectively addressing challenges related to time and resource constraints inherent in manual pairing. The evaluation indicates that the synthetic dataset is well-structured, diverse, and efficiently optimized for training natural language to NoSQL translation models. The model section outlines the fine-tuning process for LLMs to refine their capabilities and enhance performance in the spe- cific task of translating natural language queries into NoSQL queries. The Supervised Fine-Tuning is done following a Parameter-Efficient Fine-Tuning (PEFT) methodology through QLoRA. Optimal prompt de- sign takes into account user language and database context. The fine-tuning process of LLaMa2 Large Language Model (LLM), demonstrates good improvements in translating natural language queries into NoSQL queries. Comparing the fine-tuned model with the base model reveals significant advancements, with a trade-off observed as the fine-tuned model’s generalization capacity slightly decreases, especially for requests deviating significantly from those in the training dataset. Exploring fine-tuning with larger models presents a promising avenue for overall performance improvement, although challenges related to memory constraints and GPU limitations should be addressed. This research aims to contribute valuable insights to the field of natural language interfaces for NoSQL databases, addressing the critical need for a tailored dataset in the early stages of the research and emphasizing potential improvements and the dynamic evolution of natural language interfaces for NoSQL databases.

Relatori: Lorenzo Bottaccioli, Alessandro Aliberti, Edoardo Patti
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 36
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/31060
Modifica (riservato agli operatori) Modifica (riservato agli operatori)