Does this skill require a Google Cloud or AI Studio API key?

Yes, you must provide a GEMINI_API_KEY from Google AI Studio or configure Vertex AI credentials to enable the multimodal capabilities.

Can this skill generate images from text prompts?

Yes, it utilizes the gemini-2.5-flash-image model to generate, edit, and compose images with support for various aspect ratios.

How does it handle multi-page PDF documents?

It uses native vision processing for PDFs up to 1,000 pages, allowing for complex table extraction, chart analysis, and format conversion.

Which Gemini models are supported?

It supports the Gemini 2.5 and 2.0 series, including Pro, Flash, and Flash-Lite models, depending on the specific task requirements.

What is the maximum video length supported for analysis?

The skill can process up to 6 hours of low-resolution video or approximately 2 hours at default resolution using Gemini 2.5 models.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: mrgoonie

bymrgoonie

•

1,395

•

Data Science & ML

Processes, analyzes, and generates audio, video, image, and document content using Google Gemini's advanced multimodal API.

This skill empowers Claude to interact with complex multimedia assets by leveraging the Google Gemini API. It provides a unified interface for transcribing long-form audio, performing scene-level video analysis, extracting structured data from multi-page PDFs, and generating high-quality images from text prompts. By supporting massive context windows up to 2M tokens, it allows developers to implement sophisticated AI features that require deep understanding of non-textual data or the creation of visual assets directly within their coding workflow.

Key Features

01Precision document extraction from complex PDFs, tables, charts, and diagrams.

02Advanced audio transcription and speaker identification for files up to 9.5 hours.

03Deep video analysis including scene detection and temporal Q&A for 6-hour clips.

04Visual understanding with object detection, pixel-level segmentation, and OCR.

05High-fidelity text-to-image generation with controllable aspect ratios and styles.

061,395 GitHub stars

Use Cases

01Automating content summaries and timestamped transcriptions for long-form video and audio files.

02Extracting structured JSON data from complex multi-page business documents and financial reports.

03Generating and refining visual assets or UI components directly from text prompts within a project.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mrgoonie/claudekit-skills ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill