Web Scraping System

Context

An Italian certification body needed a steady stream of public attestation data to support its lead-generation team. Collecting it meant operators clicking through a dynamic web portal, downloading PDFs one at a time, and re-typing fields into spreadsheets — slow, error-prone, and impossible to scale to region-wide audits.

I built a Python platform that automates the whole chain: navigation, document download, PDF interpretation, and the production of curated CSV batches ready for downstream analysis. It runs within the portal's published automation policy, with rate limits, randomized pacing, and capped concurrency enforced in the platform itself.

What I built

Acquisition layer

Selenium workers browse the dynamic single-page application exactly as a person would, so every lazy-loaded panel is rendered before extraction.
Rotating user agents and optional proxy pools distribute traffic across endpoints so no origin is saturated.
Simple HTTP scraping was ruled out early: the portal renders its critical data through JavaScript widgets, so full DOM execution was the only reliable route.

Document intelligence

The PDF pipeline merges structured XFA parsing with table-reconstruction fallbacks, so company metadata, certifications, and classification codes survive layout drift.
A normalization step turns extracted payloads into tidy records that append to regional CSV batches without manual cleaning.

Data products and observability

Regional CSV exports are produced incrementally, supporting resumable campaigns without duplicate entries.
Audit logs record visited companies, successful extractions, fallback paths, and elapsed time — a complete operational trail.
Aggregated metrics feed dashboards so the team can gauge progress and quality at a glance.

Architecture

Operator console
      │  region / target selection
      ▼
Dispatcher ──► Selenium worker pool (behind an AWS ALB)
                 │ worker #1 … worker #N
                 ▼
          Isolated temp storage (downloaded PDFs)
                 ▼
          PDF interpreter (XFA parsing + table reconstruction)
                 ▼
          Curated CSV batches ──► versioned S3 bucket

          workers ┄┄► CloudWatch metrics & alerts

Containerised workers run on AWS Fargate, so there are no servers to maintain. Autoscaling reacts to queue depth and CPU, adding workers only when regional backlogs grow. Secrets (proxy credentials, storage keys) live in AWS Secrets Manager and are injected at runtime.

Resilience

Selenium exceptions, malformed PDFs, and recurring pop-ups trigger contained restarts, so campaigns continue without supervision.
Each processed company is flagged in a region register, enabling safe resumption after pauses or maintenance windows.

Results

The system replaced the manual collection process end to end: operators moved from copying fields out of PDFs to reviewing curated CSV batches, and campaigns that used to take days of clicking now run unattended with a full audit trail.

What I'd do next

Broaden proxy health checks with automatic failover lists.
Add anomaly detection to flag sudden shifts in attestation volumes.
Trial lightweight headless browsers for low-volume spot checks, keeping Selenium for full campaigns.