Measuring Topic-Specific Semantic Information in Product Labels: An Embedding-Based Approach

Salvatore Latino

Measuring Topic-Specific Semantic Information in Product Labels: An Embedding-Based Approach.

Rel. Luca Cagliero, Vito De Feo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 24 Aprile 2027 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (20MB)

Abstract:	Measuring how much semantic information a text conveys remains an open challenge: classical information theory quantifies uncertainty reduction but is agnostic to meaning. This thesis proposes a practical approach to quantify topic-specific semantic information in short texts. We target the domain of product labels and focus on environmental information, a context of high societal and regulatory relevance (e.g., Agenda 2030 and the forthcoming Digital Product Passport). This domain was selected as product labels usually contain concise and well-defined statements, often limited to a single claim, which minimizes ambiguity and makes them especially suitable for the quantitative analysis of semantic information. Our key idea is to estimate information coverage with respect to reference sentences, crafted in accordance with the Green Claims Directive issued by the European Union, that are assumed to be maximally informative. Candidate sentences are embedded with sentence-transformer models into a shared embedding space, where clustering is applied and distance metrics are used to assign informativeness scores to individual sentences, which are then aggregated to obtain a label-level score. To operationalize this, we construct two resources: (i) a nucleus dataset of high-information reference sentences, designed to reflect best practices in environmental disclosure; and (ii) a product labels test dataset, transcribed from real packaging. Because no public gold standard exists for this task, we designed an online questionnaire and collected human judgments of environmental informativeness on the test set; model scores are then assessed by correlation with these ratings. The framework is modular (choice of encoder, clustering algorithm, distance metric and corresponding thresholds) and requires only topic-defining references to adapt to new subtopics. Empirically, the approach yields a mean Pearson correlation of 0.676 between model-generated and human assessments across categories, indicating that the proposed methodology captures a substantial portion of what people regard as environmentally informative content. Qualitative analyses show that the score rewards precise, reference-aligned claims (e.g., quantified impacts, certified materials, lifecycle coverage) and penalizes vague or unsupported statements.
Relatori:	Luca Cagliero, Vito De Feo
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	99
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/37718

Modifica (riservato agli operatori)