WebHarvest is a self-hosted, open-source web scraper designed to transform web pages into clean, agent-friendly markdown, HTML, or structured JSON, running entirely on your local machine to ensure data privacy. It offers a robust suite of features including advanced anti-bot bypass mechanisms with auto-escalation, enabling access to heavily protected sites. Beyond basic scraping, WebHarvest integrates an LLM-driven autonomous browser agent capable of navigating, interacting, and extracting information using natural language, making complex web tasks effortless. It serves as a free alternative to commercial solutions like Firecrawl, providing comprehensive web data collection without cloud dependencies.
Key Features
01Advanced anti-bot bypass with auto-escalation strategies
021 GitHub stars
03LLM-driven autonomous browser agent for task execution
04BFS website crawling with depth limits and concurrency
05Self-hosted web scraping (markdown, HTML, JSON conversion)
06Natural language data extraction (no selectors needed)