Vision (Image Understanding)

Vision capabilities are built into LLM models that support multimodal input. Send images alongside text in the Chat Completions API.

How to Use

Pass images as content parts in the messages array:

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }]
)
print(response.choices[0].message.content)

Image Input Formats

Format	Example
URL	`{"type": "image_url", "image_url": {"url": "https://..."}}`
Base64	`{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}`

Models with Vision

Model	Max Images	Notes
`gpt-4o`	20	Best overall vision
`claude-sonnet-4`	20	Strong document understanding
`gemini-2.5-pro`	16	1M context for long documents
`meta/llama-4-maverick`	8	Open-weight alternative

→ Full capability matrix

Vision (Image Understanding) ​

How to Use ​

Image Input Formats ​

Models with Vision ​

Vision (Image Understanding)

How to Use

Image Input Formats

Models with Vision