MID: A New Strategy for Learning Optimal Decision Trees on Continuous Data

Antonio Dal Maso

MID: A New Strategy for Learning Optimal Decision Trees on Continuous Data.

Rel. Elena Maria Baralis, Siegfried Nijssen. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (7MB) | Preview

Abstract:	Optimal Decision Tree (ODT) algorithms, unlike greedy methods, are designed to find the best decision tree on training data while ensuring constraints are satisfied, such as on the depth of the tree. As these techniques are typically designed for binary datasets, they often require continuous features to be discretized before the learning process — a step that can significantly impact the quality and efficiency of the resulting decision tree. In fact, discretizers give no guarantees that the resulting tree is optimal on the training data, and they often make it unfeasible in practice to find an ODT. The focus of this thesis is a new approach for learning decision trees on continuous data that combines a new discretization algorithm, MID (acronym for "Minimum Impurity Discretizer"), with ODT learners. The core idea behind MID is to create a ranked list of binary features, obtained considering all potential thresholds across all features of a continuous dataset, in such a manner that an ODT algorithm can repeatedly be run for a growing number of binary features. The advantage of such an approach is that, given infinite time, it allows an optimal decision tree to be found. However, the search process can be interrupted at any time for a smaller number of features while still producing a tree of good quality. While this approach can be applied to any optimal decision tree learner, this work specifically examines DL8.5, which incorporates concepts from the itemset mining domain and a recursive depth-first branch-and-bound strategy to efficiently explore the search space of the solutions. Experiments on 14 datasets were conducted both to assess the quality of the features produced by MID and to explore the possibility of further optimizing the execution of DL8.5. Although the latter goal was not achieved, the experiments revealed that MID effectively generates high-quality binary features: DL8.5 trained on such features outperformed both traditional greedy methods and other discretization techniques.
Relatori:	Elena Maria Baralis, Siegfried Nijssen
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	53
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela:	UNIVERSITE CATHOLIQUE DE LOUVAIN - ECOLE POLYTECHNIQUE (BELGIO)
Aziende collaboratrici:	Université catholique de Louvain
URI:	http://webthesis.biblio.polito.it/id/eprint/33149

Modifica (riservato agli operatori)