Skip to content

feat: AsyncPlasmateCrawlerStrategy β€” lightweight alternative to Playwright (no Chrome)#1906

Open
dbhurley wants to merge 1 commit intounclecode:developfrom
dbhurley:feat/plasmate-crawler-strategy
Open

feat: AsyncPlasmateCrawlerStrategy β€” lightweight alternative to Playwright (no Chrome)#1906
dbhurley wants to merge 1 commit intounclecode:developfrom
dbhurley:feat/plasmate-crawler-strategy

Conversation

@dbhurley
Copy link
Copy Markdown

@dbhurley dbhurley commented Apr 8, 2026

Summary

Adds AsyncPlasmateCrawlerStrategy β€” a drop-in alternative to AsyncPlaywrightCrawlerStrategy using Plasmate instead of Chrome.

Directly addresses:

What Plasmate is

Open-source Rust browser engine (Apache 2.0). Fetches pages and returns them as Structured Object Model (SOM) β€” a compact, semantically clean representation with nav, ads, cookie banners, and boilerplate stripped. Install: pip install plasmate.

Compression measured across 45 real sites: 17.7Γ— average, 77Γ— peak. Every token saved before the LLM is a direct cost reduction.

Drop-in usage

from crawl4ai import AsyncWebCrawler
from crawl4ai.async_plasmate_strategy import AsyncPlasmateCrawlerStrategy

strategy = AsyncPlasmateCrawlerStrategy(
    output_format="markdown",   # text | markdown | som | links
    timeout=30,
    fallback_to_playwright=True,  # retry with Playwright for JS-heavy SPAs
)

async with AsyncWebCrawler(crawler_strategy=strategy) as crawler:
    result = await crawler.arun("https://docs.python.org/3/")
    print(result.markdown[:500])

What changed

File Change
crawl4ai/async_plasmate_strategy.py New AsyncPlasmateCrawlerStrategy implementing AsyncCrawlerStrategy ABC
crawl4ai/__init__.py Export AsyncPlasmateCrawlerStrategy
tests/general/test_plasmate_strategy.py 20 unit tests (init, cmd building, crawl, fallback, concurrency)

Comparison

AsyncPlaywrightCrawlerStrategy AsyncPlasmateCrawlerStrategy
RAM per session ~300MB ~64MB
Chrome required Yes No
Tokens per page (avg) ~75,000 (raw HTML) ~4,200 (SOM/text)
JS rendering Yes No (use fallback_to_playwright=True)
Install playwright install (~300MB browser) pip install plasmate
Persistent process Yes (browser stays alive) No (subprocess per fetch)

Notes

  • No breaking changes β€” existing AsyncPlaywrightCrawlerStrategy usage is untouched
  • fallback_to_playwright=True makes it safe for mixed static/SPA crawls
  • Subprocess runs in asyncio executor β€” fully non-blocking, safe for concurrent gather() calls
  • Tested with Python 3.9+

…laywright

Closes unclecode#1256 (memory leak in Docker from Chrome)
Related to unclecode#1874 (token usage tracking)

Plasmate (https://github.com/plasmate-labs/plasmate) is an open-source
Rust browser engine that replaces Chrome/Playwright for static pages.
No browser process, ~64MB RAM vs ~300MB, 10-100x fewer tokens per page.

Changes:
- crawl4ai/async_plasmate_strategy.py: AsyncPlasmateCrawlerStrategy
  - Implements AsyncCrawlerStrategy ABC (drop-in replacement)
  - Supports output_format: text (default), markdown, som, links
  - Supports --selector, --header, --timeout flags
  - Optional fallback_to_playwright=True for JS-heavy SPAs
  - Subprocess runs in asyncio executor β€” safe for concurrent use
- crawl4ai/__init__.py: export AsyncPlasmateCrawlerStrategy
- tests/general/test_plasmate_strategy.py: 20 unit tests

Install: pip install plasmate

Usage:
  from crawl4ai import AsyncWebCrawler
  from crawl4ai.async_plasmate_strategy import AsyncPlasmateCrawlerStrategy

  strategy = AsyncPlasmateCrawlerStrategy(
      output_format="markdown",
      fallback_to_playwright=True,   # SPA safety net
  )
  async with AsyncWebCrawler(crawler_strategy=strategy) as crawler:
      result = await crawler.arun("https://docs.python.org/3/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant