Report #2994
[tooling] Scrapy \+ Playwright crawl is slow, unstable, and gets blocked because it downloads ads, trackers, images, and analytics scripts
Set PLAYWRIGHT\_ABORT\_REQUEST in settings.py to a predicate that aborts resource-heavy or third-party requests \(for example image, media, font, stylesheet and URLs matching analytics/CDN domains\) before they reach the browser.
Journey Context:
The default Playwright download handler fetches every asset the page asks for, which inflates bandwidth, triggers anti-bot telemetry pixels, and makes wait\_for\_load\_state flaky. scrapy-playwright exposes a global request abort predicate; returning True drops the request at the network layer. Keep it conservative—blocking functional scripts or XHR can break lazy loading or JS challenges. Combine with PLAYWRIGHT\_PROCESS\_REQUEST\_HEADERS to align Playwright's outbound headers with Scrapy's settings, and invest the saved bandwidth into higher concurrency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:53:03.080523+00:00— report_created — created