
Benyamin Zarei
AI-Driven Web Scraping: Designing and Optimizing a Data Extraction Tool for the U.S. Legal Sector.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
Abstract: |
This study presents a data-driven approach to automating the extraction of structured legal profiles from the websites of the top 500 U.S. law firms. The research addresses the inefficiencies of manual data collection in legal talent acquisition by developing a scalable, AI-assisted web scraping system capable of retrieving and structuring large volumes of unstructured data. Given the variability in website architectures and the presence of anti-scraping mechanisms, a multi-layered web scraping pipeline was designed, incorporating requests, Cloudscraper, Playwright, and Selenium to ensure robust and adaptable content retrieval. Extracted HTML data was processed using AI models, leveraging structured prompt engineering techniques to optimize model performance. Through extensive experimentation, OpenAI’s GPT-4o-mini was identified as the most effective model in balancing accuracy, processing speed, and cost. Beyond web scraping, this study developed preprocessing and validation techniques to ensure data consistency and completeness. vCard and PDF parsing algorithms were implemented to extract critical contact details such as email addresses and phone numbers, supplementing AI-generated results. A post-processing validation framework was introduced to cross-check extracted job titles, practice areas, specialties, and industries against predefined lists, further improving accuracy through secondary AI models. The structured output was stored in CSV and Excel formats, also automated data-cleaning functions were used to correct missing nation values and normalize phone numbers. The system reduced the processing time per lawyer profile to a few seconds while maintaining data integrity. This research contributes to the field of automated data extraction and information retrieval, providing a solid methodology for leveraging AI and web scraping in structured data collection at scale. |
---|---|
Relatori: | Paolo Garza |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 60 |
Informazioni aggiuntive: | Tesi secretata. Fulltext non presente |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | TALENT ACQUISITION PARTNER SRL |
URI: | http://webthesis.biblio.polito.it/id/eprint/35390 |
![]() |
Modifica (riservato agli operatori) |