Firecrawl

These example posts were automatically generated by PersonaBox from GitHub pull requests.

Want posts like this?

Turn your GitHub PRs into polished product updates.

Powered by PersonaBox

PDF parser modes: fast, auto, and OCR in JS and Python SDKs#

PDF parser modes: fast, auto, and OCR in JS and Python SDKs

The JS and Python SDKs now accept a mode parameter for PDF parsing: fast for text-layer extraction, ocr for scanned documents, and auto to detect the right method automatically. Pair with maxPages to control both extraction quality and scope per request.

LLM-powered content cleaning with onlyCleanContent#

LLM-powered content cleaning with onlyCleanContent

New onlyCleanContent boolean strips non-semantic page elements (nav menus, ads, cookie banners, sidebars, footers) from scraped markdown via LLM cleaning. Runs before JSON extraction and summary, so downstream steps get cleaner input. Available in V1 and V2 APIs, defaults to false.

PDF parser modes with Rust-powered text extraction#

PDF parser modes with Rust-powered text extraction

The PDF parser now accepts a mode parameter: "fast" for Rust-powered text extraction, "auto" for smart OCR fallback, or "ocr" to force OCR processing. Responses include pdf_type, confidence, and page_count so you can route documents based on their actual characteristics.

Choose spark-1-pro or spark-1-mini for your agent jobs#

Choose spark-1-pro or spark-1-mini for your agent jobs

You can now specify which model powers your agent jobs by passing model="spark-1-pro" or model="spark-1-mini" in your request. Use spark-1-mini for bulk, cost-efficient runs and spark-1-pro when output quality matters most. Available in Python and TypeScript SDKs.

Webhook support for agent jobs in Node.js and Python SDKs#

Webhook support for agent jobs in Node.js and Python SDKs

The Node.js and Python SDKs now support webhooks for agent jobs. Receive HTTP callbacks for started, action, completed, failed, and cancelled events instead of polling. Configure webhook URLs, custom headers, metadata, and event filtering directly in your start_agent or agent calls.

Sitemap-only mode for precise, targeted crawling#

Sitemap-only mode for precise, targeted crawling

You can now set sitemap: "only" to crawl exclusively the URLs listed in a site's sitemap.xml. The crawler skips HTML link discovery, giving you precise control over which pages are ingested. Ideal for docs-only RAG pipelines, targeted monitoring, or avoiding infinite crawls from faceted navigation.

Rust SDK v2 namespace with agent support#

Rust SDK v2 namespace with agent support

The Rust SDK now includes a v2 namespace with access to all v2 API endpoints: scrape, search, map, crawl, batch_scrape, and the new agent endpoint for prompt-driven web workflows. Use scrape_with_schema() or agent_with_schema() for typed extraction, and manage async jobs with start, status, errors, and cancel methods. Fully backwards-compatible with v1.