01Configurable CSS selectors for precise extraction of main content, titles, and code blocks
02Automatic content categorization using keyword mapping and directory structures
03Pattern-based URL filtering to include or exclude specific documentation paths
0467 GitHub stars
05Resumable progress tracking via a dedicated checkpoint system for large-scale scrapes
06Agentic safety protocols including rate limiting, robots.txt compliance, and grounding checks