Agent Evaluation: Claude Code Skill for AI Benchmarking