01Advanced vision analysis featuring object detection, pixel-level segmentation, and multi-image comparison.
02Native PDF vision processing for extracting structured JSON data, tables, and charts from multi-page documents.
03Deep video understanding with scene detection, temporal Q&A, and support for public YouTube URLs.
040 GitHub stars
05Controllable image generation and editing with support for various aspect ratios and iterative refinement.
06Comprehensive audio processing including transcription with timestamps and speaker identification for up to 9.5 hours of content.