Skip to content

Vision (Image Understanding)

Vision capabilities are built into LLM models that support multimodal input. Send images alongside text in the Chat Completions API.

How to Use

Pass images as content parts in the messages array:

python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }]
)
print(response.choices[0].message.content)

Image Input Formats

FormatExample
URL{"type": "image_url", "image_url": {"url": "https://..."}}
Base64{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}

Models with Vision

ModelMax ImagesNotes
gpt-4o20Best overall vision
claude-sonnet-420Strong document understanding
gemini-2.5-pro161M context for long documents
meta/llama-4-maverick8Open-weight alternative

Full capability matrix