Politecnico di Torino (logo)

Itemset-based document summarization of multilingual collections driven by pre-trained word vectors

Zhao, Yifu

Itemset-based document summarization of multilingual collections driven by pre-trained word vectors.

Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2018

PDF (Tesi_di_laurea) - Tesi
Accesso al documento: Accesso libero
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (466kB) | Preview

With the gradual progress of information technology, people become much more easier to create, store and disseminate information in electronic form. Billions of users on Internet create quintillion of bytes everyday, even though the rich information is beneficial for human beings on several levels, the amount of information and knowledge are growing exponentially, which makes people difficult of find useful information. Taking into account the efficiency of information utilization, a viable solution for getting critical information from a large collection of document is to generate readable and concise summaries containing the most relevant information automatically. Automatic summary may collect the most relevant facts and common views in several sentences, avoiding getting lost in the large set of original tests. Text mining refers to the acquisition of valuable information an knowledge from text data, which is a method in data mining. The most important and basic application in text mining is to realize the classification and clustering of texts. The former is a supervised mining algorithm and the latter is an unsupervised mining algorithm. The disciplines associated with text mining are very broad, with the combination of the knowledge in probability theory and statistical mathematical analysis, and the application of data mining and machine learning techniques, text mining is applied in natural language processing and information extraction, many studies are working to introduce new technologies to bring improvements to text mining. Although text mining can accomplish tasks well in many fields, in the summary task, it has limited information because the size of datasets are usually quite small(only a few kByte). On the other hand, machine learning algorithms for the text have sprung up. Word embedding technology makes it possible for computers to understand words. By converting words into vectors that can express word features, computers can better understand the relationship between words and words, words and paragraphs. The goal of this thesis work is to exploit the powerfulness of word embedding to improve multilingual summarization performance. Specifically, it aims at integrating pre-trained word vector information into a state-of-the-art multilingual summarization approach. The summarizer generates a summary of a collection of textual documents consisting of a selection of the most significant sentences. It analyzes the correlation among multiple text words to drive sentence selection. Based on word embeddings, we are able to discriminate between significant sentences and not according to the relevance of the contained words. The proposed methods were applied to DUC’04. TAC’11, MultiLing’13 corpus respectively, and the generated summaries are evaluated according to the corresponding standard, then compared with the scores of other summarizers state-of-the-art. The results show that the use of word vectors in pre-processing stage can effectively improve the results, but in the sentence selection stage does not achieve the results as expected. For the use of word vector in pre-processing stage, the results of experiment can reach the top level. This method is stable in multi-language environment and is not affected by language types.

Relatori: Luca Cagliero
Anno accademico: 2018/19
Tipo di pubblicazione: Elettronica
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/9503
Modifica (riservato agli operatori) Modifica (riservato agli operatori)