polito.it
Politecnico di Torino (logo)

Transformers-based Abstractive Summarization for the Generation of Patent Claims

Sara Moreno

Transformers-based Abstractive Summarization for the Generation of Patent Claims.

Rel. Luca Cagliero, Moreno La Quatra. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (7MB) | Preview
Abstract:

Transformers-based Abstractive Summarization for the Generation of Patent Claims. A patent is a kind of intellectual property that grants its owner the authority to exclude others to make, sell, or use an invention for a fixed amount of time in exchange for sufficient disclosure of the invention. Each patent is made of five main parts: abstract, background, summary, description and claims. If the patent is accurately written, only the claims formulation task is up to the Intellectual Property (IP) attorney. The most important claim is the first one. It is composed of a single sentence, has to set out the distinctive features of the invention and underline the differences from the inventions already present in the same or similar field. Since the patent filing rate is constantly increasing it is necessary to find ways to fasten the analyses of patents in order to keep up with the innovations. In this thesis, the goal is to generate the first claim of a patent document using abstractive summarization techniques that can understand the context and meaning of the input text, and generate fluent and coherent first claims. This is done because the IP attorney task can be thought of as a summarization task. The study focuses on two main research questions: which patent sections are the most effective for generating the first claim, and how does the length of the input text impact the performance of the summarization models. This research could have significant implications for the legal and innovation communities by improving the accuracy and efficiency of automated patent claim generation. Seven different input texts are analyzed, including single sections as well as combinations of two sections. The boundaries of each section are highlighted using special tokens to help the models recognize the semantic content. To investigate the impact of context width, two models are compared: PEGASUS and BigBird-PEGASUS, both based on the transformer architecture but with different attention mechanisms that lead to the ability to process documents of different lengths. The results are evaluated through two metrics, ROUGE, a metric that favours the syntactic similarities by the generated text and the one taken as ground truth, and BERTScore, which privileges the semantic similarities. Although the ROUGE metric generally leads to higher results for texts generated with extractive summarization techniques, the obtained results are significantly high for an abstractive summarization task. In particular, they are higher than the results obtained in previous works. The results show that the input section deeply affects the first claim generation performance. The best input text turns out to be the combination of summary and abstract, whereas the least informative section is the description. In all the cases BigBird-PEGASUS, the model that processes longer documents, leads to higher performances, at the expense of the training time that is almost three times that of PEGASUS.

Relatori: Luca Cagliero, Moreno La Quatra
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 132
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/26720
Modifica (riservato agli operatori) Modifica (riservato agli operatori)