Crawl4AI
Verified against crawl4ai VERSION. PEP 723 pins in scripts/*.py and tests/*.py floor at that
version.
Overview
Crawl4AI wraps a headless browser (Playwright) plus a markdown-aware content pipeline. Use it when defuddle/curl can't reach the content — JavaScript-rendered pages, login-gated content, infinite scroll, multi-URL concurrency, repeatable schema-based extraction.
This skill exposes both interfaces of the underlying library:
- CLI (
crwl) — quick, scriptable commands: CLI Guide - Python SDK — full programmatic control: SDK Guide
Invoked with a URL argument
When the user runs /crawl4ai <url> with a single URL and no further qualifier, treat it as the JS-heavy fetch case and
default to:
crwl <url> -c "wait_until=networkidle,page_timeout=60000" -o markdown
wait_until=networkidle waits for the network to be quiet for ~500ms post-load — the right default when the user hasn't
named a specific element on a JS-rendered page. (Avoid wait_for=css:body: <body> exists at t=0 on every HTML
response, so it's satisfied before JS renders content.) Then return the markdown to the agent context. Adjust to
wait_for=css:<selector> if the user named a specific element. Skip the default and route to the relevant section below
for any task that names extraction, batch / multi-URL, login / session, screenshot / PDF, or URL discovery — those each
have their own pipeline. If the URL is clearly static (a docs page, a blog post), route the user to /fetch-web instead
per the "When NOT to use" section below.
When NOT to use this skill
- Static HTML pages (most documentation sites, blog posts, news articles, tweets) — use
/fetch-webordefuddledirectly. Static extraction is ~0ms cold start; crawl4ai pays a ~2s browser startup tax. - Local file conversion (
.pdf,.docx,.pptx,.epub) — use/markdown-convert. - One-URL agent-context reads (the agent just needs to read this page) — use
/fetch-weband let it route todefuddle. - Mutating UI flows (form fills, multi-step clicks, login + navigation) —
/browse(gstack's persistent headless Chromium) is built for that.
When stuck
For unknown crwl/SDK flags, scrape failures, or extraction edge cases the references don't cover, see references/escalation.md for the lookup order (qmd solutions → upstream docs → GitHub issues → ask the user) and worked examples.
Quick Start
Installation
pip install crawl4ai
crawl4ai-setup
# Verify installation
crawl4ai-doctor
CLI (Recommended)
# Basic crawling - returns markdown
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
# See more examples
crwl --example
Python SDK
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
For SDK configuration details: SDK Guide - Configuration.
Core Concepts
Configuration Layers
Both CLI and SDK use the same underlying configuration:
| Concept | CLI | SDK |
|---|---|---|
| Browser settings | -B browser.yml or -b "param=value" | BrowserConfig(...) |
| Crawl settings | -C crawler.yml or -c "param=value" | CrawlerRunConfig(...) |
| Extraction | -e extract.yml -s schema.json | extraction_strategy=... |
| Content filter | -f filter.yml | markdown_generator=... |
Key Parameters
Browser Configuration:
headless: Run with/without GUIviewport_width/height: Browser dimensionsuser_agent: Custom user agentproxy_config: Proxy settings
Crawler Configuration:
page_timeout: Max page load time (ms)wait_for: CSS selector or JS condition to wait forcache_mode: bypass, enabled, disabledjs_code: JavaScript to executecss_selector: Focus on specific element
For complete parameters: CLI Config | SDK Config
Output Content
Every crawl returns:
- markdown - Clean, formatted markdown
- html - Raw HTML
- links - Internal and external links discovered
- media - Images, videos, audio found
- extracted_content - Structured data (if extraction configured)
Markdown Generation (Primary Use Case)
Crawl4AI excels at generating clean, well-formatted markdown.
CLI
crwl https://docs.example.com -o markdown # raw markdown
crwl https://docs.example.com -o markdown-fit # filtered (noise removed)
crwl https://docs.example.com -f templates/filter_bm25.yml -o markdown-fit # BM25-relevance filter
crwl https://docs.example.com -f templates/filter_pruning.yml -o markdown-fit # quality-based filter
Filter templates: templates/filter_bm25.yml (relevance-scored against a query),
templates/filter_pruning.yml (no query, prunes low-quality blocks).
Python SDK
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # Original
For filter selection and config field reference, see Content Filters.
Data Extraction
1. Schema-Based CSS Extraction (Most Efficient)
No LLM required at extract time — fast, deterministic, cost-free. One-time LLM cost to derive the schema, then reuse indefinitely. The bundled scripts split the pipeline by responsibility:
./scripts/generate_schema.py https://shop.example.com "products with name, price, image" shop_schema.json
./scripts/extract_with_schema.py https://shop.example.com shop_schema.json products.json
Or via the CLI with the YAML strategy template + the saved schema:
crwl https://shop.example.com -e templates/extract_css.yml -s shop_schema.json -o json
Schema skeleton: templates/css_schema.json. Strategy YAML:
templates/extract_css.yml.
2. LLM-Based Extraction
For one-off / irregular content where a CSS schema is too brittle:
./scripts/extract_with_llm.py https://news.example.com "Extract headlines, dates, summaries" news.json
Or via the CLI with the strategy template:
crwl https://news.example.com -e templates/extract_llm.yml -o json
Strategy YAML: templates/extract_llm.yml. Pays an LLM call per URL — for repeat
extraction, prefer the schema pipeline above.
For extraction strategy reference: Extraction Strategies.
Advanced Patterns
Dynamic Content (JavaScript-Heavy Sites)
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
crwl https://example.com -C templates/crawler.yml # all options in a YAML file
Crawler config template: templates/crawler.yml.
Multi-URL Processing
./scripts/batch_crawl.py urls.txt --max-concurrent 5 --out batch_markdown/
./scripts/batch_extract.py urls.txt shop_schema.json --max-concurrent 5 --out products.json
The two scripts split on responsibility: batch_crawl.py returns markdown per URL; batch_extract.py returns
schema-extracted JSON per URL. Python equivalent uses arun_many():
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
For batch processing reference: arun_many() Reference.
URL Discovery Before Crawl
When the URL list comes from a sitemap / domain rather than a known list, do discovery first, then feed the result into
batch_crawl.py / batch_extract.py. See URL Discovery for the full surface; quick
shape:
from crawl4ai import AsyncUrlSeeder, SeedingConfig
seeds = await AsyncUrlSeeder().urls("example.com", SeedingConfig(
source="sitemap+cc", pattern="*/blog/*", query="machine learning", score_threshold=0.3, live_check=True,
))
urls = [s["url"] for s in seeds]
AsyncUrlSeeder is best when you want BM25-scored filtering against a query; DomainMapper is best when you want
maximum coverage of one domain.
Session & Authentication
Fill the login template, then reuse the session id on subsequent crawls:
crwl https://site.com/login -C templates/login_crawler.yml
crwl https://site.com/protected -c "session_id=user_session"
Login template: templates/login_crawler.yml (fill in the field-id selectors and the
post-login wait condition before use).
For session management reference: Advanced Features.
Anti-Detection & Proxies
crwl https://example.com -B templates/browser.yml
Browser config template: templates/browser.yml (uncomment proxy_config and init_scripts
as needed). For pre-page-load script injection (fingerprint patches that must fire before any site script), populate
init_scripts: rather than js_code: (which fires after the page loads). proxy_config works with both the browser
strategy and the non-browser HTTPCrawlerStrategy — the latter is the cheap path for static fetches behind a corporate
proxy.
Full surface (CDP attachment, undetected mode, init script patterns): Anti-Detection.
Rendering Cached HTML (raw: / file://)
If the agent already has HTML in hand (e.g., from defuddle or a previous crawl) and only needs a screenshot, PDF, or
MHTML render, skip the network fetch and pass the HTML directly. base_url controls relative-link resolution:
result = await crawler.arun(
url="raw:" + html_string,
config=CrawlerRunConfig(base_url="https://example.com", screenshot=True, pdf=True),
)
result = await crawler.arun(
url="file:///path/to/page.html",
config=CrawlerRunConfig(screenshot=True),
)
Common Use Cases
Eight worked end-to-end flows (docs page, JS-heavy SPA, e-commerce product extraction, news aggregation, topic-bound domain crawl, login-required content, render existing HTML, Q&A) live in Recipes. Pick the recipe closest to the task at hand and adapt.
Resources
Provided Scripts
| Script | Responsibility |
|---|---|
scripts/basic_crawler.py <url> | One URL → markdown + screenshot |
scripts/batch_crawl.py <urls.txt> | Many URLs → markdown files |
scripts/batch_extract.py <urls.txt> <schema.json> | Many URLs + schema → JSON |
scripts/generate_schema.py <url> "<instruction>" | Derive a reusable CSS schema (one-time LLM call) |
scripts/extract_with_schema.py <url> <schema.json> | Apply a saved schema (no LLM) |
scripts/extract_with_llm.py <url> "<instruction>" | Per-request LLM extraction (expensive; one-off only) |
Templates
YAML and JSON skeletons users copy and fill. All sit at the skill root under templates/:
| Template | Used for |
|---|---|
templates/browser.yml | BrowserConfig (headless, proxy, user agent, init scripts) |
templates/crawler.yml | CrawlerRunConfig (cache, wait, timeout, JS) |
templates/extract_css.yml | JsonCssExtractionStrategy declaration |
templates/extract_llm.yml | LLMExtractionStrategy declaration |
templates/filter_bm25.yml | BM25 content filter (relevance-scored) |
templates/filter_pruning.yml | Pruning content filter (quality-based, no query) |
templates/login_crawler.yml | Session-establishing login flow |
templates/css_schema.json | CSS schema skeleton |
Reference Documentation
| Document | Purpose |
|---|---|
| CLI Guide | Command-line interface reference |
| SDK Guide | Python SDK quick reference |
| Recipes | Eight worked end-to-end flows |
| URL Discovery | AsyncUrlSeeder, SeedingConfig, DomainMapper |
| Content Filters | BM25 vs Pruning vs LLMContentFilter — when to use which |
| Anti-Detection | init_scripts, proxy_config, undetected mode, CDP attachment |
| Troubleshooting | Symptoms, causes, fixes; what to try before escalating |
| Complete SDK Reference | Full API documentation (5900+ lines) |
| Escalation | Lookup order, iron rule, halt-vs-continue, worked examples |
Best Practices
- Start with CLI for quick tasks, SDK for automation
- Use schema-based extraction - 10-100x more efficient than LLM
- Enable caching during development -
--bypass-cacheonly when needed - Set appropriate timeouts - 30s normal, 60s+ for JS-heavy sites
- Use content filters for cleaner, focused markdown
- Respect rate limits - Add delays between requests
Troubleshooting
For symptom → cause → fix tables (JS not loading, bot detection, empty extracted content, session not persisting, slow crawl, schema generation nonsense, post-upgrade regressions), see Troubleshooting. For unknown surface the references don't cover, follow Escalation.
For comprehensive API documentation, see Complete SDK Reference.
License
Dual-licensed under MIT OR Apache-2.0 at your option (SPDX: MIT OR Apache-2.0). See
LICENSE for the explainer + the carve-out for the upstream-mirrored references/complete-sdk-reference.md.
