Exploiting Large Language Models for Relational Database Design

Alessia Tierno

Exploiting Large Language Models for Relational Database Design.

Rel. Silvia Anna Chiusano, Alessandro Fiori, Andrea Avignone. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024

Abstract:	Large language models (LLMs) have undergone a rapid and widespread diffusion, profoundly impacting both the world of work and education. The need to make the most of these tools has led to the birth of a new discipline: prompt engineering. It enables the development and optimization of prompts, the initial inputs provided to a language model to generate responses, maximizing LLM capabilities while also revealing their limitations and areas for potential improvement. One area where development remains limited is database design, despite it is the process of structuring data for efficient management, a crucial step that demands respect of rules and constraints. One of the most important types of database design are Entity-Relationship (ER) models. Therefore, the aim of this thesis is to try to automate the generation of textual descriptions from ER models, and vice versa, to facilitate the modelling process, reducing the risk of human errors and the workload required for designers. Additionally, it would serve as an educational support tool for teachers and as an interactive learning tool for student, allowing them to see in real time how their descriptions are transformed into ER models. To make the work even more useful and to actually visualize the graphical representation of ER models, integration with designER, a web application for database design, has been implemented. Furthermore, the texts used are taken from exam topics and exercises proposed in the "Introduction to Databases" courses to make it a useful teaching aid for the courses themselves. For this research a pipeline has been developed to manage all aspects of the process, including pre-processing, prompt generation, post-processing and evaluation of different constructs and different LLMs. In fact, it was structured on the use of multiple large language models: GPT, Llama and Mistral, with the aim of comparing their limitations and identifying which one produces the best results with minimal pre-processing. The final statistical results suggest that GPT and Llama require more pre-processing to comprehend the elements of ER models, even for simple examples, while Mistral requires less pre-processing and handles complexity better. A crucial observation is that all models demonstrate considerable creativity and ability to understand the context starting from an ER model. Despite these abilities, limitations and some recurring errors persist, highlighting areas of potential improvement for LLM.
Relatori:	Silvia Anna Chiusano, Alessandro Fiori, Andrea Avignone
Anno accademico:	2023/24
Tipo di pubblicazione:	Elettronica
Numero di pagine:	102
Informazioni aggiuntive:	Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/31832

Modifica (riservato agli operatori)