polito.it
Politecnico di Torino (logo)

Spiking Neural Networks for Speech Recognition: integration of spiking neurons in a sequence‑to‑sequence architecture

Vittorio Frangipani

Spiking Neural Networks for Speech Recognition: integration of spiking neurons in a sequence‑to‑sequence architecture.

Rel. Stefano Di Carlo, Alessandro Savino, Filippo Marostica, Alessio Caviglia. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2025

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

Speech recognition is progressively shifting toward edge devices. In this context, achieving low energy consumption and lightweight models is essential. Spiking Neural Networks (SNNs) are promising candidates due to their potential for energy efficiency and their inherent ability to capture temporal dynamics. However, their sequential nature often leads to long training times, and the spiking format of their inputs and outputs requires dedicated strategies for information encoding and decoding. Moreover, SNNs are still not widely adopted, with relatively few studies investigating their application to speech recognition. The goal of this thesis is to evaluate how spiking networks can impact speech recognition tasks. Starting from a known sequence-to-sequence architecture, modifications are introduced, including the integration of Spiking Long Short-Term Memory (SLSTM) layers and spiking convolutional layers. These components are analyzed in terms of both training time and recognition accuracy. For decoding, the count rate technique—based on the neurons’ firing rates—is employed. To interface Artificial Neural Network (ANN) and SNN components, two strategies are explored: a duplication-based CNN/SLSTM interface, found to be unsustainable when introducing Spiking Convolutio Neural Networks (SCNNs), and a convolution-plus-Leaky Integrate-and-Fire (LIF) interface for the data-to-SCNN connection, which proves more efficient in terms of training time. The results, along with the intermediate stages of the proposed hybrid SNN–ANN architecture evaluated on the LibriSpeech dataset, are presented and discussed. The experiments show that, although introducing SLSTM layers initially leads to a performance drop (from 10.61% to 22.16% Word Error Rate), the use of spiking convolutional layers helps recover part of this loss, improving the overall recognition accuracy. Finally, potential future directions are outlined, focusing on the integration of binary convolutional layers and the development of fully spiking encoders, along with preliminary implementation insights.

Relatori: Stefano Di Carlo, Alessandro Savino, Filippo Marostica, Alessio Caviglia
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 74
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-29 - INGEGNERIA ELETTRONICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/38726
Modifica (riservato agli operatori) Modifica (riservato agli operatori)