About
This skill integrates Google Gemini's advanced multimodal capabilities directly into your workflow, enabling deep analysis and generation of diverse media types. It provides a unified interface for transcribing audio up to 9.5 hours, analyzing video content up to 6 hours, and performing native PDF vision processing for complex documents. Whether you need to detect objects in images, extract structured data from forms, or generate high-fidelity images from text prompts, this skill offers the tools and scripts necessary to handle massive context windows and complex media tasks efficiently.