Luca Bergamini
GenAI-NewsScraper: Automated News Scraping, Summarization, Enrichment, and Multimodal Content Generation.
Rel. Riccardo Coppola. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025
|
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (7MB) |
| Abstract: |
The rapid growth of digital media has created a need for automated systems that can efficiently retrieve, process, and deliver news content. This thesis presents the design and implementation of a generative AI (GenAI) system for automated news scraping, content enrichment and summarization, and multimodal output generation, aimed at supporting scalable media workflows and interactive user experiences. The main objective is to develop an agent-based architecture that autonomously collects news from different sources, enriches it with related material, summarizes key information, and delivers results via text and audio formats. The system relies on a Model Context Protocol (MCP) server for orchestration, with modular tools for vector-based data storage, LLM-driven web search, and Text-to-Speech (TTS) synthesis. Structured web scraping, multi-document summarization, and vector embeddings (using PostgreSQL with pgvector) enable efficient data processing, while TTS supports automated podcast generation and interactive newsletters. The prototype demonstrates the system’s ability to handle end-to-end news workflows with minimal human intervention. Cost and usage analysis indicates that, for a single user, the daily operation—including scraping, summarization, embeddings, and podcast generation—amounts to approximately $0.95, or $28.5 per month. For 100 users, the daily per-user cost drops to about $0.653 ($19.6 per month), and for 1000 users, it further reduces to $0.6503 ($19.5 per month), due to the shared baseline scraping cost of $0.30 per day. Typical usage with one access per day incurs $0.65–$0.75 daily per user, while high-activity scenarios with multiple accesses and queries increase costs to $1.55–$1.85 per day ($46.5–$55.5 per month). This analysis highlights the linear scaling of per-user operational costs versus the shared scraping baseline, providing insights for accurate budgeting and resource planning. In conclusion, this thesis contributes a practical framework for integrating LLMs, modular tools, and MCP-based orchestration into automated news pipelines. The system demonstrates scalability, multimodal capabilities, and cost efficiency, illustrating the potential of agent-based architectures for smart, interactive media platforms. |
|---|---|
| Relatori: | Riccardo Coppola |
| Anno accademico: | 2025/26 |
| Tipo di pubblicazione: | Elettronica |
| Numero di pagine: | 72 |
| Soggetti: | |
| Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
| Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
| Aziende collaboratrici: | DATA Reply S.r.l. con Unico Socio |
| URI: | http://webthesis.biblio.polito.it/id/eprint/37616 |
![]() |
Modifica (riservato agli operatori) |



Licenza Creative Commons - Attribuzione 3.0 Italia