Alessandro Nori
Company entities matching framework powered by machine learning.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020
Abstract: |
Data matching is an essential process of all enterprises which constantly acquire new data from different systems, both structured and non. This process is usually used to remove duplicates from a database or to avoid the creation of already existing accounts when no common key between the two databases exist. Since data is coming from different sources, a massive step of data cleaning and standardization is needed in order to achieve better similarity measures between records, more representative of the reality. It is also important to apply input reduction techniques, such as blocking predicates, to reduce the number of records compared, otherwise extremely large. The complete number of pairs of records given a database is proportional to the square of its size and a source of 100 thousands records will generate 10 billions of possible pairs. If for each pair, features generation and consequent classification required only 1ms, the entire process would require more 115 days to emit all the results. Our framework filters a large part of those pairs through selective blocking predicates, based on the assumption of the small portion of real matches with respect to the total possible number of pairs which allows to immediately discard obvious non-matches that doesn't satisfy some hard-coded rules. Behind the core step of our data matching tool, the classification of pairs, a machine learning model is in charge of accumulate new data and continuosly improve the decisions taken. The choice of machine learning relies on the concept of continuos improvement and the ability of extracting patterns not easily recognizable by humans and sometimes difficult to be translated as rules. The framework collects labeled data every time a user wants to process a new task, since in most of the cases he will be asked for the review of uncertain classified pairs. What distinguish our tool from others already existing is the specialization in business partner entity matching, where company name, location and industry type are taken into account, together with other attributes domain specific. Being aware of the existence of such fields helps the cretion of ad-hoc features and similarities, based on the domain of each of them (e.g. industry type similarity can only word if computed on a common value space). All previous aspects allows a fast data matching process and to achieve a very high precision on the records classified as match. Results and evaluation have been performed on real-world datasets and on real user scenarios. |
---|---|
Relatori: | Paolo Garza |
Anno accademico: | 2019/20 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 74 |
Informazioni aggiuntive: | Tesi secretata. Fulltext non presente |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Ente in cotutela: | TELECOM ParisTech - EURECOM (FRANCIA) |
Aziende collaboratrici: | SAP France SA |
URI: | http://webthesis.biblio.polito.it/id/eprint/14375 |
Modifica (riservato agli operatori) |