Which Gemini models are supported by this skill?

It supports the Gemini 2.0 and 2.5 series, including Pro, Flash, and Flash-Lite models, allowing you to balance speed, cost, and context window requirements.

Can I generate and edit images with this skill?

Yes, by utilizing the Gemini Flash-Image models, you can perform text-to-image generation, image editing, and multi-image composition directly via the provided scripts.

Does this skill require a paid Google Cloud account?

It is compatible with both Google AI Studio (which offers a generous free tier) and Vertex AI for enterprise-grade Google Cloud Platform deployments.

Can I process large files like long videos or audio recordings?

Yes, the skill utilizes the Gemini File API to support video processing up to 6 hours and audio transcription for files up to 9.5 hours.

How does document extraction work with this skill?

The skill uses native vision-based processing for PDFs up to 1,000 pages, enabling it to 'see' and extract data from tables, charts, and handwritten forms into structured JSON.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: GGPrompts

byGGPrompts

•

Data Science & ML

Enables Claude to process, analyze, and generate audio, image, video, and document content using Google Gemini APIs.

This skill integrates advanced multimodal capabilities into the Claude Code environment by leveraging Google's Gemini 2.0 and 2.5 models. it provides a unified interface for complex media tasks, allowing developers to transcribe hours of audio, perform OCR on multi-page PDFs, analyze video scenes with temporal accuracy, and generate high-fidelity images directly from text prompts. With support for context windows up to 2 million tokens, it is ideal for building AI-powered features that require deep understanding of diverse media formats and structured data extraction.

Key Features

01Comprehensive visual understanding including object detection, OCR, and pixel-level segmentation

02Advanced audio transcription and speaker identification for files up to 9.5 hours

03Native PDF processing for structured data extraction from tables, forms, and diagrams

04High-fidelity text-to-image generation and editing with controllable styles and aspect ratios

05Long-form video analysis with scene detection and temporal Q&A for up to 6 hours of content

062 GitHub stars

Use Cases

01Generating descriptive metadata, summaries, and searchable transcripts for large video and audio libraries

02Building AI-driven visual inspection or content moderation tools using image segmentation and object detection

03Automating data extraction from complex financial PDF reports, tables, and technical diagrams

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ggprompts/my-gg-plugins ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill