How long can the processed audio or video be?

With a 2M token context window, it can process up to 9.5 hours of audio or approximately 6 hours of low-resolution video content.

Can I generate images with this skill?

Yes, it includes full support for text-to-image generation, editing, and multi-image composition using the Gemini 2.5 Flash Image model.

Does this skill support YouTube videos?

Yes, the ai-multimodal skill can analyze public YouTube videos via URLs for transcription, summarization, and scene detection.

Which Gemini models are recommended for this skill?

Gemini 2.5 Flash is recommended for its balance of speed and performance, while Gemini 2.5 Pro is used for the highest quality analysis and 2M token context.

What are the limits for PDF processing?

The skill supports native PDF vision processing for up to 1,000 pages, allowing for deep analysis of tables, forms, and diagrams.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: samhvw8

bysamhvw8

•

Data Science & ML

Leverages Google Gemini API to process and analyze audio, video, images, and documents directly within Claude.

This skill integrates Google Gemini’s powerful multimodal capabilities into the Claude environment, enabling advanced processing of diverse media types including audio transcription for up to 9.5 hours, video analysis of YouTube URLs, and high-fidelity image generation. It provides a unified interface for extracting structured data from PDFs, performing object detection, and conducting visual question-answering, making it an essential tool for developers needing to bridge the gap between complex multimedia content and text-based AI workflows.

Key Features

01Native PDF processing for table extraction and structured data output from documents up to 1,000 pages.

02Visual understanding including OCR, object detection, and pixel-level segmentation via Gemini 2.5.

03Advanced image generation and iterative editing using controllable styles and aspect ratios.

04Video analysis for scene detection and temporal Q&A with support for local files and YouTube URLs.

051 GitHub stars

06Comprehensive audio transcription and analysis with timestamp support for files up to 9.5 hours.

Use Cases

01Generating and refining visual assets or UI mockups using text-to-image prompts and composition tools.

02Automated transcription and summarization of long-form technical meetings or video tutorials.

03Extracting structured JSON data from complex multi-page PDF forms, charts, and technical diagrams.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add samhvw8/dot-claude ai-multimodal

For use in Claude.ai and ChatGPT

Download Skill