SigLIP Vision Encoder

The SigLIP Vision Encoder serves as the visual backbone of the PaliGemma architecture. It is responsible for converting raw pixel data into a sequence of continuous embeddings (visual tokens) that the language model can interpret. This implementation follows the Vision Transformer (ViT) approach, utilizing patch-based embeddings and a transformer-encoder architecture.

SiglipVisionConfig

The SiglipVisionConfig class defines the hyperparameters for the vision encoder. These settings determine the model's capacity and the granularity of the image processing.

| Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | hidden_size | int | 768 | Dimensionality of the encoder layers and the pooling layer. | | intermediate_size | int | 3072 | Dimensionality of the "intermediate" (feed-forward) layer. | | num_hidden_layers | int | 12 | Number of hidden layers in the Transformer encoder. | | num_attention_heads | int | 12 | Number of attention heads for each attention layer. | | num_channels | int | 3 | The number of input channels (RGB). | | image_size | int | 224 | The size (resolution) of each image. | | patch_size | int | 16 | The size (resolution) of each patch. | | layer_norm_eps | float | 1e-6 | The epsilon used by the layer normalization layers. |

SiglipVisionModel

The SiglipVisionModel is the primary class for extracting visual features. It orchestrates the embedding process and the subsequent transformer layers.

Forward Pass

def forward(self, pixel_values: torch.FloatTensor) -> torch.FloatTensor:

Inputs:

pixel_values (torch.FloatTensor of shape [batch_size, channels, height, width]): Preprocessed image tensors. Typically, images are resized to the image_size defined in the config and normalized.

Outputs:

last_hidden_state (torch.FloatTensor of shape [batch_size, num_patches, hidden_size]): The sequence of hidden-states at the output of the last layer of the model.

Usage Example

from modeling_siglip import SiglipVisionConfig, SiglipVisionModel
import torch

# Initialize configuration and model
config = SiglipVisionConfig(image_size=224, patch_size=16, hidden_size=768)
model = SiglipVisionModel(config)

# Simulated preprocessed image [Batch, Channels, Height, Width]
pixel_values = torch.randn(1, 3, 224, 224)

# Get visual embeddings
with torch.no_grad():
    visual_outputs = model(pixel_values)

print(visual_outputs.shape) 
# Expected Output: torch.Size([1, 196, 768]) 
# (196 patches = (224/16)^2)

Internal Components

While the following components are managed internally by SiglipVisionModel, understanding their role is helpful for conceptualizing the vision pipeline.

SiglipVisionEmbeddings

This module handles the "patchification" of the image.

Patch Embedding: It uses a 2D convolution with a kernel size and stride equal to the patch_size. This effectively divides the image into a grid of non-overlapping patches.
Flattening: The grid is flattened into a sequence of tokens.
Position Embedding: Since Transformers are permutation-invariant, a learned position_embedding is added to the patch embeddings to retain spatial information.

SiglipVisionTransformer

A standard Transformer Encoder stack. Each layer consists of:

Multi-Head Attention: Allows the model to attend to different parts of the image simultaneously.
MLP (Feed-Forward): Two linear layers with a non-linear activation (typically gelu).
Layer Normalization: Applied before the attention and MLP blocks (Pre-Norm architecture).

Performance Note

In the context of the full VLM, the number of visual tokens is calculated as: $$\text{num_tokens} = \left(\frac{\text{image_size}}{\text{patch_size}}\right)^2$$ For a $224 \times 224$ image with a $16 \times 16$ patch size, this results in 196 tokens, which are then projected and prepended to the text tokens for the language model.