Report #2384
[tooling] Scrapy \+ Playwright is slow and burns bandwidth on images/CSS/fonts
In scrapy-playwright, set PLAYWRIGHT\_ABORT\_REQUEST = lambda req: req.resource\_type in \{'image', 'stylesheet', 'font', 'media'\} in settings.py. This aborts resource requests at the browser level before they download, while still letting JavaScript run and data extract. Pair it with PLAYWRIGHT\_MAX\_PAGES\_PER\_CONTEXT to bound memory.
Journey Context:
The naive Scrapy\+Playwright integration loads full pages including megabytes of images, ads, and CSS, which is pointless for data extraction and makes crawling slower and costlier. Playwright's route.abort, exposed via PLAYWRIGHT\_ABORT\_REQUEST, drops resource types by category, often cutting page weight by 70-90% and avoiding domains that might flag you. The common mistake is blocking URLs with regex after the request starts; aborting by resource\_type in the request handler prevents the network round-trip entirely. This is the right call for any JS-rendered crawl where you only need text/DOM, not pixels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:50:42.668253+00:00— report_created — created