What schema formats does the skill support?

The skill utilizes standard JSON Schema (v2.0) for both documents and queries, ensuring all metadata, IDs, and content structures meet strict production requirements.

How does this skill help with duplicate detection?

It implements similarity thresholds to identify and block content that is too similar to existing entries, preventing redundant data from skewing evaluation results.

What is a golden dataset in AI development?

A golden dataset is a curated, high-quality set of ground-truth data (documents and queries) used to measure the accuracy and performance of AI models and RAG systems.

Can I use this for RAG system testing?

Yes, it is specifically designed to validate the datasets used for Retrieval-Augmented Generation (RAG) by ensuring queries correctly reference valid document sections.

Does it check for query difficulty levels?

Yes, it includes validation rules to ensure a balanced distribution of query difficulties, ranging from trivial to adversarial, for more robust model testing.

Golden Dataset Validator

Name: Golden Dataset Validator
Author: yonatangross

byyonatangross

•

Data Science & ML

Ensures the integrity and quality of AI evaluation datasets through automated schema validation, duplicate detection, and coverage analysis.

About

This skill is designed for data engineers and AI developers who maintain 'golden' ground-truth datasets used for RAG evaluation or model benchmarking. It provides a robust framework for validating document and query schemas, detecting near-duplicate content through similarity thresholds, and ensuring comprehensive coverage across domains and difficulty levels. By automating these integrity checks, the skill helps maintain high-fidelity datasets that produce reliable, reproducible performance metrics for AI systems.

Key Features

29 GitHub stars
Automated JSON schema validation for documents and query pairs
Referential integrity checking for document-to-section mappings
Pre-commit validation patterns to maintain dataset quality
Semantic duplicate detection with configurable similarity thresholds
Comprehensive coverage analysis for domains and content types

Use Cases

Auditing existing datasets to identify gaps in query difficulty distribution
Preventing data corruption by ensuring unique identifiers and valid canonical URLs
Validating new document submissions before inclusion in a RAG evaluation suite

About

Key Features

29 GitHub stars
Automated JSON schema validation for documents and query pairs
Referential integrity checking for document-to-section mappings
Pre-commit validation patterns to maintain dataset quality
Semantic duplicate detection with configurable similarity thresholds
Comprehensive coverage analysis for domains and content types

Use Cases

Auditing existing datasets to identify gaps in query difficulty distribution
Preventing data corruption by ensuring unique identifiers and valid canonical URLs
Validating new document submissions before inclusion in a RAG evaluation suite