Benyamin Zarei
AI-Driven Web Scraping: Designing and Optimizing a Data Extraction Tool for the U.S. Legal Sector.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
Abstract
This study presents a data-driven approach to automating the extraction of structured legal profiles from the websites of the top 500 U.S. law firms. The research addresses the inefficiencies of manual data collection in legal talent acquisition by developing a scalable, AI-assisted web scraping system capable of retrieving and structuring large volumes of unstructured data. Given the variability in website architectures and the presence of anti-scraping mechanisms, a multi-layered web scraping pipeline was designed, incorporating requests, Cloudscraper, Playwright, and Selenium to ensure robust and adaptable content retrieval. Extracted HTML data was processed using AI models, leveraging structured prompt engineering techniques to optimize model performance.
Through extensive experimentation, OpenAI’s GPT-4o-mini was identified as the most effective model in balancing accuracy, processing speed, and cost
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Informazioni aggiuntive
Corso di laurea
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
