polito.it
Politecnico di Torino (logo)

Generative AI for Real-Time Image Captioning on Embedded Neural Processing Unit

Marco Donnarumma

Generative AI for Real-Time Image Captioning on Embedded Neural Processing Unit.

Rel. Carlo Masone, Ilario Gerlero, Marcello Babbi. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Abstract:

The emergence of Edge AI has opened new possibilities for deploying machine learning models on power-constrained devices, enabling real-time, private, and efficient inference directly on embedded platforms. Among these tasks, image captioning remains particularly challenging due to the computational demands of vision-language (VL) models, which typically rely on large-scale transformer architectures. This thesis addresses the gap between high-performing generative captioning and efficient edge deployment by adapting Microsoft's GIT-Base model for execution on the Hailo-8 Neural Processing Unit (NPU). We present a full pipeline for real-time image-to-text generation on an embedded system based on the i.MX8M Plus SoC and Hailo-8 NPU. To make the GIT-Base model compatible with the stringent constraints of edge hardware, we optimized both the encoder and decoder components and thoroughly reengineered the decoder architecture. This included fixed-point quantization (INT8 and INT16), approximation of unsupported operations, and the introduction of input padding to decouple inference from data-dependent shape variability. The existing attention mask logic was adapted to avoid interfering with quantization, and parts of the attention computation were modified to ensure numerical stability under low-precision arithmetic. Additionally, we redesigned the final projection layer to efficiently support a large vocabulary within Hailo-8’s resource limits. The final system achieves near real-time inference within a 1–2 W power budget and delivers competitive captioning performance against existing edge-deployable models on standard benchmarks—including CIDEr, BLEU@4, METEOR, and SPICE. Through extensive benchmarking in CPU-only, hybrid, and full-NPU configurations, we demonstrate that medium-sized transformer-based VL models can be deployed on embedded hardware without markedly compromising the expressiveness or fluency of generated captions. Future work will focus on extending this pipeline toward video captioning and deploying large language models (LLMs) on more powerful NPUs, to broaden the scope and capabilities of embedded generative AI.

Relatori: Carlo Masone, Ilario Gerlero, Marcello Babbi
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 109
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: SENSOR REPLY S.R.L. CON UNICO SOCIO
URI: http://webthesis.biblio.polito.it/id/eprint/36387
Modifica (riservato agli operatori) Modifica (riservato agli operatori)