01Text-to-image generation and editing with controllable styles and aspect ratios
02Image captioning, object detection, and pixel-level segmentation
030 GitHub stars
04Native PDF vision processing for extracting tables, forms, and diagrams from multi-page documents
05Video scene detection and temporal analysis for recordings up to 6 hours
06Audio transcription, speaker identification, and music analysis for files up to 9.5 hours