MID: A New Strategy for Learning Optimal Decision Trees on Continuous Data

Antonio Dal Maso

MID: A New Strategy for Learning Optimal Decision Trees on Continuous Data.

Rel. Elena Maria Baralis, Siegfried Nijssen. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (7MB) | Preview

Abstract

Optimal Decision Tree (ODT) algorithms, unlike greedy methods, are designed to find the best decision tree on training data while ensuring constraints are satisfied, such as on the depth of the tree. As these techniques are typically designed for binary datasets, they often require continuous features to be discretized before the learning process — a step that can significantly impact the quality and efficiency of the resulting decision tree. In fact, discretizers give no guarantees that the resulting tree is optimal on the training data, and they often make it unfeasible in practice to find an ODT. The focus of this thesis is a new approach for learning decision trees on continuous data that combines a new discretization algorithm, MID (acronym for "Minimum Impurity Discretizer"), with ODT learners.

The core idea behind MID is to create a ranked list of binary features, obtained considering all potential thresholds across all features of a continuous dataset, in such a manner that an ODT algorithm can repeatedly be run for a growing number of binary features