Can I use this to prepare datasets for machine learning?

Yes, it includes built-in capabilities for merging disparate sources, deduplication, shuffling, and splitting data into stratified train/test/validation sets.

How does the skill handle database and API credentials?

It follows security best practices by using environment variable placeholders in the configuration files (e.g., ${DB_PASSWORD}), ensuring sensitive information is never stored in plaintext.

What data sources can I connect to with this skill?

The skill supports a wide range of sources including SQL databases (PostgreSQL, MySQL, SQLite), REST/GraphQL APIs, web scraping (Playwright/Requests), LLM-generated synthetic data, and local file imports (CSV/JSON/JSONL).

How do I ensure the data remains consistent across iterations?

The skill generates a 'data_version.yaml' and a regeneration script that uses fixed random seeds and configuration hashes to ensure that the data pipeline produces identical results every time it is run.

Data Pipeline Configurator

Name: Data Pipeline Configurator
Author: p988744

byp988744

0•

Data Science & ML

Automates the configuration and management of reproducible data collection pipelines from databases, APIs, web scraping, and LLM generation.

This skill provides a standardized framework for building robust data pipelines within Claude Code, focusing on data reproducibility and traceability for machine learning workflows. It enables users to define multiple data sources—including SQL databases, REST APIs, web scrapers, and synthetic data generators—within a unified YAML configuration. By automating data fetching, merging, validation, and splitting, the skill ensures that training datasets can be reconstructed identically during model iteration, while maintaining security through environment variable management.

Key Features

01Standardized data_source.yaml configuration schema

02Automated Python script generation for data regeneration

03Reproducibility tracking with versioning and random seed control

04Multi-source integration for SQL, APIs, Web Scraping, and LLMs

05Built-in data validation, deduplication, and quality reporting

060 GitHub stars

Use Cases

01Generating and validating synthetic training data using LLMs

02Setting up reproducible web scraping pipelines for market research

03Merging legacy database records with external API data for ML preprocessing

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add p988744/nlp-skills data-pipeline

For use in Claude.ai and ChatGPT

Download Skill