AI-Driven Web Scraping: Designing and Optimizing a Data Extraction Tool for the U.S. Legal Sector

Benyamin Zarei

AI-Driven Web Scraping: Designing and Optimizing a Data Extraction Tool for the U.S. Legal Sector.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Abstract

This study presents a data-driven approach to automating the extraction of structured legal profiles from the websites of the top 500 U.S. law firms. The research addresses the inefficiencies of manual data collection in legal talent acquisition by developing a scalable, AI-assisted web scraping system capable of retrieving and structuring large volumes of unstructured data. Given the variability in website architectures and the presence of anti-scraping mechanisms, a multi-layered web scraping pipeline was designed, incorporating requests, Cloudscraper, Playwright, and Selenium to ensure robust and adaptable content retrieval. Extracted HTML data was processed using AI models, leveraging structured prompt engineering techniques to optimize model performance.

Through extensive experimentation, OpenAI’s GPT-4o-mini was identified as the most effective model in balancing accuracy, processing speed, and cost

Tipo di pubblicazione

Elettronica

URI

https://webthesis.biblio.polito.it/id/eprint/35390

Modifica (riservato agli operatori)