polito.it
Politecnico di Torino (logo)

Relevant Phrase Generation for Language Learners

Davide Pinti

Relevant Phrase Generation for Language Learners.

Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview
Abstract:

In recent times Artificial Intelligence, Natural Language Processing (NLP) especially, has spread widely. Nowadays, most people use it, either directly or indirectly, and you can find it almost everywhere: from social networks, to generating images based on text prompts, to the automatic grammar checker of written text. In the field of NLP, the generation of text through large language models (LLMs) is becoming more and more dominant, especially since the release of the Transformer in 2017. The objective of this study was to leverage the powerful tools of NLP to generate contextually appropriate English sentences for language learners, when given a specific English keyword as input. Other papers in the field of Intelligent Computer-Assisted Language Learning (ICALL) have generated examples for language learners by either retrieval- or ranking-based models, but this is the first time generative language models have been used in this context. We claim that our model developed in this project, SpeakEasy, is useful for language learners that are trying to learn a new language. The generated sentences have three important characteristics: (1) They are relevant in context to the keyword; (2) They are in ``simple'' English suitable for language learners; (3) They always include the \underline{exact} form of the keyword given in the prompt. We achieve this by first fine-tuning a GPT--2 model implemented by HuggingFace to a dataset of human-written sentences specifically tailored to language learners from \href{https://tatoeba.org/}{Tatoeba.org}. A decoding algorithm consisting of two steps was implemented. Initially, a shift to the probability distribution the context around the keyword was applied using Keyword2Text for controlled generation. Subsequently, the vocabulary is truncated to only include words that have an information content close to the language model itself, picked up during domain adaptation, using locally typical sampling. The generated sentences are similar to the human-written examples with a MAUVE score of 0.755 and an average cosine similarity of 0.577. Moreover, only 1.83\% of sentences generated were identical to entries in the dataset and a sentence took on average 0.45 seconds to be generated. Qualitative human evaluation showed that examples generated by SpeakEasy did not only beat a fine-tuned version of GPT--2 with hybrid top-$k$ and nucleus sampling scheme, but also out competing the human-written sentences with an average rank of 2.13 on a scale from 1 (best) to 4 (worst).

Relatori: Luca Cagliero
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 83
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: CTH - Chalmers Tekniska Hoegskola AB (SVEZIA)
Aziende collaboratrici: Chalmers University of Technology
URI: http://webthesis.biblio.polito.it/id/eprint/27675
Modifica (riservato agli operatori) Modifica (riservato agli operatori)