Digital Transformation Physical Security Data Privacy

Google debuts Gemini Embedding 2 multimodal AI model

Wed, 11th Mar 2026

Google has released Gemini Embedding 2 in public preview, which it describes as its first natively multimodal embedding model for developers using the Gemini API and Vertex AI.

The model generates embeddings for text, images, video, audio, and documents in a single shared embedding space. This enables retrieval and classification across media types within the same system.

Embedding models convert content into numerical representations that software can compare for similarity. They are commonly used for semantic search, clustering, and classification, and in Retrieval-Augmented Generation workflows to help select relevant material from a data store.

Gemini Embedding 2 expands Google's earlier text-only embeddings to handle multiple modalities. Google says it captures semantic intent across more than 100 languages.

Modalities supported

The model supports text prompts with a context window of up to 8,192 input tokens. For images, it can process up to six images per request and accepts PNG and JPEG formats.

For video, it supports up to 120 seconds of input and accepts MP4 and MOV. For audio, it ingests and embeds content directly, without requiring a transcription step.

The model also supports document inputs, embedding PDFs of up to six pages. This targets organisations that store content as reports, manuals, forms, or scanned documents that are not already structured for search or analytics.

In addition to single-modality inputs, Gemini Embedding 2 accepts interleaved input. Developers can submit combinations such as image and text in one request, producing a single embedding that represents information split across media types.

Output dimensions

Gemini Embedding 2 uses Matryoshka Representation Learning, which makes embeddings usable at different sizes. Google says the model can scale down from its default output of 3,072 dimensions, allowing developers to trade storage and compute costs against quality.

Google recommends 3,072, 1,536, and 768 dimensions for the highest-quality settings. Lower dimensions can reduce vector database index sizes and cut query costs for high-volume similarity search.

Performance claims

Google positions Gemini Embedding 2 as an advance over its prior embedding models and says it sets a new standard for multimodal depth. It also highlights speech performance alongside text, image, and video tasks.

The release puts Google in more direct competition with other providers offering multimodal embeddings, as businesses expand search and analytics from text into mixed-media archives. Product teams are increasingly using embeddings to organise and query internal knowledge that includes screenshots, training videos, recorded meetings, and customer calls.

In practice, multimodal embeddings can reduce the need for separate pipelines per media type. A common alternative approach transcribes audio into text, extracts keyframes from video, then runs separate models for each format before combining results. A single model can simplify that workflow, although many deployments still add processing for metadata, compliance, and access control.

Ecosystem access

Gemini Embedding 2 is available through Google's Gemini API and Vertex AI during the public preview. Google says developers can also access it through tools and vector database ecosystems including LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, and Vector Search.

These integrations matter because embedding models are rarely used in isolation. They typically sit behind a vector index that stores embeddings for a corpus and serves nearest-neighbour search queries for applications such as enterprise search, customer support assistants, and content moderation pipelines.

Google says early-access partners are using the model for multimodal applications, but it did not name organisations or provide deployment details.

A product note from the Google DeepMind team emphasised that the model is aimed at developers building retrieval and classification systems across mixed data sources.