Skip to main content
By default, the agent picks the best model for each task. You can override for a single generation or an entire project from the prompt box settings, or by telling the agent directly. When using the API, GET /models returns the current list.

Image

ModeDescription
Text to imageGenerate from a text prompt
Image to imageEdit or transform an existing image
Ingredients to imageCombine multiple reference images into a new image

Available image models

ModelIDModesAspect ratiosMax refsMax variants
Nano Banana Progoogle-nano-banana-protext, image, references16:9, 9:16, 21:9, 1:151
Nano Banana 2google-nano-banana-2text, image, references16:9, 9:16, 21:9, 1:1141
Nano Bananagoogle-nano-bananatext, image, references16:9, 9:16, 21:9, 1:131
Seedream 5.0 Liteseedream-5text, image, references16:9, 9:16, 21:9, 1:1145
Seedream 4.5seedream-4.5text, image, references16:9, 9:16, 21:9, 1:1105
Flux 2 Maxflux-2-maxtext, image, references16:9, 9:1681
GPT Image 1.5gpt-image-1.5text, image, references16:9, 9:16105
Grok Imagine Progrok-imagine-image-protext, image16:9, 9:16, 1:11
Riverflow 2.0 Proriverflow-2-protext, image, references16:9, 9:16, 21:9, 1:1101

Video

ModeDescription
Text to videoGenerate from a text prompt
Image to videoGenerate video from a starting image
Ingredients to videoCombine multiple reference images into a video
Video to videoEdit or transform an existing video
Audio to videoGenerate video synced to speech or music
Multi-shotMultiple shots in a single generation, each with its own direction
Lip-syncGenerate video synced to a voiceover
Start + end frameSet the first and last frame, video is generated between them

Available video models

ModelIDModesDurationsAudioKey features
Kling 3.0 Omnikling-3.0-omniimage, video, references, multi-shot, end frame3–15sYes7 ref images, 1 ref video, V2V
Kling 3.0kling-3.0image, multi-shot, end frame3–15sYes1080p
Kling O1 Editkling-o1image, video, references, end frame3–10sNoV2V editing, 7 ref images, preserves original sound
Kling 2.6kling-2.6text, image5, 10sYesNegative prompt, 1080p
Veo 3.1google-veo-3.1text, image, end frame4, 6, 8sYesUp to 1080p
Veo 3.1 Fastgoogle-veo-3.1-fasttext, image, end frame4, 6, 8sYesUp to 1080p
Seedance 1.5 Proseedance-1.5-protext, image, end frame4–12sYesAll aspect ratios, 1080p
Wan 2.6wan-2.6text, image, audio, end frame, lip-sync5, 10, 15sAlways onAudio input sync
LTX 2.3 Proltx-2.3-protext, image, end frame6, 8, 10sYesCamera motion (dolly, jib, tracking, static, focus shift), up to 4K
LTX 2.3 Fastltx-2.3-fasttext, image, end frame6–20sYesCamera motion, up to 4K, longest durations
Sora 2 Proopenai-sora-2-protext, image4, 8, 12sAlways onUp to 1024p
Sora 2openai-sora-2text, image4, 8, 12sAlways on720p
Grok Imagine Videogrok-imagine-videotext, image, video1–15sAlways onV2V, flexible durations, 720p

Audio

TypeDescription
VoiceoverGenerate speech across 60+ languages
MusicGenerate music from text prompts
Sound effectsGenerate sound effects and ambient audio

Voiceover

ModelIDKey features
ElevenLabs v3elevenlabs60+ languages, tone control, voice transform, adjustable speed

Music

ModelIDDurationDuration control
ElevenLabs Musicelevenlabs3s–10 minYes
MiniMax Music 2.5minimaxVaries by lyrics lengthNo
Google Lyria 3 ProgoogleUp to 3 minNo

Sound effects

ModelIDDurationKey features
ElevenLabs SFXelevenlabs0.5–22sLooping, prompt influence control