How does it handle large volumes of papers?

The skill includes configurable guardrails such as --max-papers, --max-pages, and --min-chars to help you manage processing time, storage, and API costs.

Where is the extracted text and metadata stored?

Extracted text is saved as .txt files in papers/fulltext/, original PDFs are cached in papers/pdfs/, and a record of all attempts is kept in papers/fulltext_index.jsonl.

Does this skill require an internet connection?

By default, yes, it requires an internet connection to download PDFs from the web. However, it supports a 'local-only' mode if you manually provide PDFs in the papers/pdfs/ directory.

Can I prevent the skill from re-extracting text I already have?

Yes, the skill is conservative by design and will not overwrite existing extracted text unless you manually delete the corresponding .txt file.

PDF Text Extractor

Name: PDF Text Extractor
Author: WILLOSCAR

byWILLOSCAR

•

Web Scraping & Data Collection

Automates the downloading and text extraction of academic PDFs to provide high-fidelity evidence for research pipelines.

The PDF Text Extractor skill is designed for research-heavy workflows where abstract-level data is insufficient for deep analysis. It selectively downloads academic papers from URLs or ArXiv IDs, caches them locally, and extracts clean text to support claim verification and evidence-based writing. By offering configurable evidence modes and local PDF support, it allows researchers to balance resource consumption with the need for exhaustive evidence gathering, maintaining a structured JSONL index of all processed documents for seamless pipeline integration.

Key Features

01Structured JSONL indexing for automated status tracking and stats

02Configurable evidence modes (Abstract vs. Full-text) to manage resources

03Clean plain-text extraction with customizable page and character limits

0489 GitHub stars

05Local PDF caching and 'Local-only' processing for restricted networks

06Automated PDF downloading from URLs and ArXiv IDs

Use Cases

01Verifying scientific claims by extracting specific evidence snippets from paper bodies

02Building a full-text searchable database for systematic literature reviews

03Automating the collection of local research libraries for AI-assisted data analysis

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add willoscar/research-units-pipeline-skills pdf-text-extractor

For use in Claude.ai and ChatGPT

Download Skill