Exploring Domain-Adapted LLMs for Crash Narrative Information Extraction

Mattia Carlino

Exploring Domain-Adapted LLMs for Crash Narrative Information Extraction.

Rel. Flavio Giobergia. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (11MB) | Preview

Abstract

Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. But they remain challenging to analyze at scale due to unstructured writing, heterogeneous terminology, and uneven detail. The development of Large Language Models (LLMs) offers a promising way to automatically extract information from narratives by asking questions. However, crash narratives remain hard for LLMs to analyze because of a lack of traffic safety domain knowledge. Moreover, relying on closed-source LLMs through external APIs poses privacy risks for crash data and often underperforms due to limited traffic knowledge. Motivated by these concerns, we study whether smaller open-source LLMs can support reasoning-intensive extraction from crash narratives, targeting three challenging objectives: the travel direction of the vehicles involved in the crash, identifying the manner of collision, and classifying crash type in multi-vehicle scenarios that require accurate per-vehicle prediction.

In the first phase of the experiments, we focused on extracting vehicle travel directions by comparing small LLMs with 8 billion parameters (Mistral, DeepSeek, and Qwen) under different prompting strategies against fine-tuned transformers (BERT, RoBERTa, and SciBERT) on a manually labeled subset of the Crash Investigation Sampling System (CISS) dataset