PersonaBox - We Run Your Product Updates

PDF parser modes: fast, auto, and OCR in JS and Python SDKs#

The JS and Python SDKs now accept a mode parameter for PDF parsing: fast for text-layer extraction, ocr for scanned documents, and auto to detect the right method automatically. Pair with maxPages to control both extraction quality and scope per request.

LLM-powered content cleaning with onlyCleanContent#

New onlyCleanContent boolean strips non-semantic page elements (nav menus, ads, cookie banners, sidebars, footers) from scraped markdown via LLM cleaning. Runs before JSON extraction and summary, so downstream steps get cleaner input. Available in V1 and V2 APIs, defaults to false.

PDF parser modes with Rust-powered text extraction#

The PDF parser now accepts a mode parameter: "fast" for Rust-powered text extraction, "auto" for smart OCR fallback, or "ocr" to force OCR processing. Responses include pdf_type, confidence, and page_count so you can route documents based on their actual characteristics.

Choose spark-1-pro or spark-1-mini for your agent jobs#

You can now specify which model powers your agent jobs by passing model="spark-1-pro" or model="spark-1-mini" in your request. Use spark-1-mini for bulk, cost-efficient runs and spark-1-pro when output quality matters most. Available in Python and TypeScript SDKs.

Webhook support for agent jobs in Node.js and Python SDKs#

The Node.js and Python SDKs now support webhooks for agent jobs. Receive HTTP callbacks for started, action, completed, failed, and cancelled events instead of polling. Configure webhook URLs, custom headers, metadata, and event filtering directly in your start_agent or agent calls.

Sitemap-only mode for precise, targeted crawling#

You can now set sitemap: "only" to crawl exclusively the URLs listed in a site's sitemap.xml. The crawler skips HTML link discovery, giving you precise control over which pages are ingested. Ideal for docs-only RAG pipelines, targeted monitoring, or avoiding infinite crawls from faceted navigation.

Rust SDK v2 namespace with agent support#

The Rust SDK now includes a v2 namespace with access to all v2 API endpoints: scrape, search, map, crawl, batch_scrape, and the new agent endpoint for prompt-driven web workflows. Use scrape_with_schema() or agent_with_schema() for typed extraction, and manage async jobs with start, status, errors, and cancel methods. Fully backwards-compatible with v1.