Patch Embeddings & Positions

The core of the vision encoder's ability to process images lies in the Patch Embedding layer. Unlike traditional CNNs that process images through successive downsampling, the vision component of this VLM (SigLIP) treats an image as a sequence of visual "tokens," making it compatible with Transformer architectures.

Overview

The SiglipVisionEmbeddings module transforms a raw input image — represented as a tensor of pixel values — into a sequence of vector embeddings. This process involves two primary steps:

Patchification: Dividing the image into fixed-size square patches.
Positional Encoding: Adding spatial information to each patch so the model understands where each "token" was located in the original image.

Configuration Parameters

The behavior of the patch embedding layer is governed by the SiglipVisionConfig. The following parameters are critical for determining the shape and number of visual tokens:

| Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | image_size | int | 224 | The expected input resolution (Height and Width). | | patch_size | int | 16 | The dimensions of each square patch. | | num_channels| int | 3 | The number of input color channels (e.g., RGB). | | hidden_size | int | 768 | The dimensionality of the resulting embedding vector. |

Visual Token Calculation

The total number of tokens generated from a single image is calculated as: $$\text{Num Patches} = \left( \frac{\text{image_size}}{\text{patch_size}} \right)^2$$ For the default configuration ($224 \div 16$), the model produces 196 visual tokens.

The Embedding Process

Convolutional Projection

The model uses a 2D Convolutional layer (nn.Conv2d) to perform patch extraction and linear projection simultaneously. By setting the kernel_size and stride equal to the patch_size, the model ensures that patches are non-overlapping.

Input: [Batch_Size, Channels, Height, Width]
Output: [Batch_Size, Hidden_Size, Num_Patches_H, Num_Patches_W]

After the convolution, the output is flattened and transposed into a sequence of shape [Batch_Size, Num_Patches, Hidden_Size].

Learned Positional Embeddings

Because Transformers are permutation-invariant, they cannot inherently distinguish the order or position of tokens. To solve this, SiglipVisionEmbeddings maintains a position_embedding—a learned lookup table of shape [Num_Patches, Hidden_Size].

These embeddings are added element-wise to the patch projections, providing the model with the spatial coordinates of each patch.

Usage Example

While typically handled internally by the PaliGemmaForConditionalGeneration class, you can interact with the embedding layer directly for inspection or custom pipelines.

import torch
from modeling_siglip import SiglipVisionConfig, SiglipVisionEmbeddings

# 1. Setup Configuration
config = SiglipVisionConfig(
    image_size=224,
    patch_size=16,
    hidden_size=768
)

# 2. Initialize the Embedding Layer
embed_layer = SiglipVisionEmbeddings(config)

# 3. Simulate a preprocessed image batch [Batch, Channels, Height, Width]
# Note: Images should be normalized and resized to 224x224
pixel_values = torch.randn(1, 3, 224, 224)

# 4. Generate Visual Tokens
visual_embeddings = embed_layer(pixel_values)

print(visual_embeddings.shape) 
# Expected Output: torch.Size([1, 196, 768])

API Reference

`SiglipVisionEmbeddings`

`forward(pixel_values)`

Processes a batch of images into a sequence of embeddings.

Input:
- pixel_values (torch.FloatTensor): A tensor of shape (batch_size, num_channels, height, width).
Returns:
- embeddings (torch.Tensor): A tensor of shape (batch_size, num_patches, hidden_size) representing the visual tokens ready for the Transformer blocks.