What is the format of the output data?

It generates a collection summary in Markdown, a full structured JSON file for all metadata, and a JSONL file for efficient training data loading.

What file formats does the Training Set Builder support?

The skill supports PDF, Word (DOCX), Markdown, Plain Text, and various source code files, maintaining structural references throughout the extraction process.

Can it filter out low-quality feedback?

Yes, the skill is designed to identify and flag generic comments (like 'Good work') or procedural notes that lack the substance needed for a high-quality training example.

How does the skill handle feedback and revisions?

It maps comments or interventions to the specific source text they reference and captures the resulting revision to create a 'before and after' pattern for LLM learning.

Training Set Builder

Name: Training Set Builder
Author: nicsuzor

bynicsuzor

0•

Data Science & ML

Extracts structured training examples from document sets to create high-quality datasets for teaching LLMs specific tasks or styles.

The Training Set Builder skill automates the process of transforming document collections into structured training data for Large Language Models. By analyzing original text, feedback annotations, and revised versions, it captures the underlying patterns of improvement and pedagogical judgment. It supports a wide range of formats—including PDFs, Word documents, and source code—and outputs machine-ready JSON and JSONL files. This skill is essential for AI researchers and developers who need to build domain-specific datasets from existing review workflows, academic feedback, or revision histories.

Key Features

01Supports PDF, DOCX, Markdown, and source code extraction

020 GitHub stars

03Generates structured JSON and JSONL datasets for model fine-tuning

04Identifies and flags ambiguous or low-quality training examples

05Automatically categorizes feedback into structural, substantive, and stylistic types

06Extracts source-feedback-revision-context patterns from multiple documents

Use Cases

01Building academic writing datasets from peer review comments and manuscript revisions

02Extracting institutional style guides from annotated policy and legal documents

03Creating code review training sets from pull request histories and diffs

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add nicsuzor/academicops training-set-builder

For use in Claude.ai and ChatGPT

Download Skill