Does this skill support OCR for scanned documents?

Yes, the skill supports OCR via LlamaParse and Unstructured.io (with Tesseract) for processing scanned images and non-selectable text PDFs.

Can I use this for local document processing without an API key?

Yes, backends like Unstructured.io, PyPDF2, and PDFPlumber run locally on your machine and do not require external API calls or keys.

What file formats are supported?

The skill currently supports PDF, DOCX, HTML, Markdown, and plain text files, with automatic format detection in the universal parser template.

Which parser should I use for complex PDF tables?

LlamaParse or PDFPlumber are highly recommended for high-fidelity table extraction. LlamaParse uses AI for layout understanding, while PDFPlumber provides coordinate-level control.

How does this skill help with RAG pipelines?

It provides specialized templates for parsing documents into manageable chunks with relevant metadata, making them ready for embedding and vector database ingestion.

Document Parsing & Extraction

Name: Document Parsing & Extraction
Author: vanman2024

byvanman2024

0•

Data Science & ML

Extracts text, tables, and metadata from PDF, DOCX, and HTML documents to power RAG pipelines and data processing workflows.

This skill provides a comprehensive suite of tools for processing diverse document formats, enabling Claude to autonomously parse complex layouts, extract structured data, and prepare content for AI applications. It integrates industry-leading libraries like LlamaParse for AI-powered extraction, Unstructured.io for local processing, and specialized tools for high-fidelity table extraction. Whether you're building a Retrieval-Augmented Generation (RAG) system, processing legal contracts, or analyzing research papers, this skill automates the transformation of unstructured documents into clean, usable data with minimal manual effort.

Key Features

01Multi-format support for PDF, DOCX, HTML, and Markdown extraction

02Advanced table extraction using PDFPlumber and AI-powered LlamaParse

03Automated document chunking and metadata extraction for RAG pipelines

04Support for multiple backends allowing for both local and cloud-based processing

050 GitHub stars

06Built-in OCR capabilities for scanned documents and complex layouts

Use Cases

01Batch converting legacy Word and HTML documents into clean Markdown for documentation

02Building automated RAG pipelines for AI-powered search and Q&A systems

03Extracting structured financial or legal data from complex PDF reports and contracts

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add vanman2024/ai-dev-marketplace document-parsers

For use in Claude.ai and ChatGPT

Download Skill