What are the maximum file limits for processing?

The skill supports audio files up to 9.5 hours, video up to 6 hours, and PDF documents up to 1,000 pages, leveraging Gemini's 2M token context window.

Does this skill require a separate API key?

Yes, you need a Google AI Studio or Vertex AI API key configured as GEMINI_API_KEY in your environment to use these multimodal features.

Can I generate images with this skill?

Yes, it supports text-to-image generation, editing, and refinement using specialized models like gemini-2.5-flash-image.

Does it support YouTube videos?

Yes, the skill can process public YouTube URLs directly for scene detection, summarization, and transcription.

Which Gemini models are recommended?

Gemini 2.5 Flash is recommended for most tasks due to its balance of speed and features, while Gemini 2.5 Pro is best for high-quality complex reasoning.

AI Multimodal Processing

Name: AI Multimodal Processing
Author: Microck

byMicrock

•

Data Science & ML

Processes and generates multimedia content including audio, video, images, and documents using the Google Gemini API.

About

This skill integrates Google Gemini's advanced multimodal capabilities directly into your workflow, enabling deep analysis and generation of diverse media types. It provides a unified interface for transcribing audio up to 9.5 hours, analyzing video content up to 6 hours, and performing native PDF vision processing for complex documents. Whether you need to detect objects in images, extract structured data from forms, or generate high-fidelity images from text prompts, this skill offers the tools and scripts necessary to handle massive context windows and complex media tasks efficiently.

Key Features

Native PDF vision processing for multi-page document extraction and table analysis.
Pixel-level image segmentation and object detection using Gemini 2.0/2.5 models.
Comprehensive audio transcription and analysis for files up to 9.5 hours.
Advanced video understanding with scene detection and temporal Q&A support.
High-fidelity text-to-image generation, editing, and multi-image composition.
81 GitHub stars

Use Cases

Automating the transcription and summarization of long-form meetings, lectures, or podcasts.
Building automated media pipelines for image captioning, object localization, and visual Q&A.
Extracting structured JSON data from complex multi-page PDF reports and financial statements.

About

Key Features

Native PDF vision processing for multi-page document extraction and table analysis.
Pixel-level image segmentation and object detection using Gemini 2.0/2.5 models.
Comprehensive audio transcription and analysis for files up to 9.5 hours.
Advanced video understanding with scene detection and temporal Q&A support.
High-fidelity text-to-image generation, editing, and multi-image composition.
81 GitHub stars

Use Cases

Automating the transcription and summarization of long-form meetings, lectures, or podcasts.
Building automated media pipelines for image captioning, object localization, and visual Q&A.
Extracting structured JSON data from complex multi-page PDF reports and financial statements.