Company entities matching framework powered by machine learning

Alessandro Nori

Company entities matching framework powered by machine learning.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020

Abstract

Data matching is an essential process of all enterprises which constantly acquire new data from different systems, both structured and non. This process is usually used to remove duplicates from a database or to avoid the creation of already existing accounts when no common key between the two databases exist. Since data is coming from different sources, a massive step of data cleaning and standardization is needed in order to achieve better similarity measures between records, more representative of the reality. It is also important to apply input reduction techniques, such as blocking predicates, to reduce the number of records compared, otherwise extremely large.

The complete number of pairs of records given a database is proportional to the square of its size and a source of 100 thousands records will generate 10 billions of possible pairs