What file formats are supported for multimodal processing?

The skill supports JPEG, PNG, WEBP, HEIC for images; MP4, MOV, AVI for video; WAV, MP3, FLAC, AAC for audio; and native PDF documents.

What is the maximum duration for video and audio files?

Gemini 3 Pro supports video files up to 1 hour in duration and audio files up to 9.5 hours for transcription and analysis.

How can I optimize token usage and costs?

You can use the 'media_resolution' parameter to choose between low (280 tokens), medium (560 tokens), and high (1,120 tokens) per image or frame.

Can it identify specific speakers in audio files?

Yes, the audio processing capabilities include speaker identification, speech-to-text transcription, and the ability to identify key timestamps.

Does this skill support OCR for screenshots and documents?

Yes, by setting the resolution to 'high', the model provides advanced OCR capabilities for extracting text from images, UI screenshots, and PDF pages.

Gemini 3 Multimodal Input Processing

Name: Gemini 3 Multimodal Input Processing
Author: adaptationio

byadaptationio

Data Science & ML

Analyzes and extracts insights from images, videos, audio, and PDF documents using Gemini 3 Pro's native multimodal capabilities.

About

This skill provides a comprehensive framework for processing multimodal data within Claude. It enables deep analysis of various media types including image recognition, OCR, object detection, video summarization, audio transcription, and complex multi-page PDF document extraction. By offering granular control over media resolution and token optimization, it allows developers to balance quality and cost effectively while building AI-driven applications that need to interpret visual, auditory, and document-based information.

Key Features

Deep Video Analysis for up to 1 hour of content with timestamped insights
Native Multi-page PDF Document Extraction and structural analysis
0 GitHub stars
Granular Media Resolution Control (low, medium, high) for token optimization
Advanced Image Understanding including object detection and high-fidelity OCR
Long-form Audio Processing supporting up to 9.5 hours of speech and transcription

Use Cases

Automating complex document and form processing from multi-page PDFs
Generating automated summaries and scene detection for video content
Transcribing and analyzing high-volume audio for meetings or podcasts

About

Key Features

Deep Video Analysis for up to 1 hour of content with timestamped insights
Native Multi-page PDF Document Extraction and structural analysis
0 GitHub stars
Granular Media Resolution Control (low, medium, high) for token optimization
Advanced Image Understanding including object detection and high-fidelity OCR
Long-form Audio Processing supporting up to 9.5 hours of speech and transcription

Use Cases

Automating complex document and form processing from multi-page PDFs
Generating automated summaries and scene detection for video content
Transcribing and analyzing high-volume audio for meetings or podcasts