SigLIP Vision Encoder
The SigLIP Vision Encoder serves as the visual backbone of the PaliGemma architecture. It is responsible for converting raw pixel data into a sequence of continuous embeddings (visual tokens) that the language model can interpret. This implementation follows the Vision Transformer (ViT) approach, utilizing patch-based embeddings and a transformer-encoder architecture.
SiglipVisionConfig
The SiglipVisionConfig class defines the hyperparameters for the vision encoder. These settings determine the model's capacity and the granularity of the image processing.
| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| hidden_size | int | 768 | Dimensionality of the encoder layers and the pooling layer. |
| intermediate_size | int | 3072 | Dimensionality of the "intermediate" (feed-forward) layer. |
| num_hidden_layers | int | 12 | Number of hidden layers in the Transformer encoder. |
| num_attention_heads | int | 12 | Number of attention heads for each attention layer. |
| num_channels | int | 3 | The number of input channels (RGB). |
| image_size | int | 224 | The size (resolution) of each image. |
| patch_size | int | 16 | The size (resolution) of each patch. |
| layer_norm_eps | float | 1e-6 | The epsilon used by the layer normalization layers. |
SiglipVisionModel
The SiglipVisionModel is the primary class for extracting visual features. It orchestrates the embedding process and the subsequent transformer layers.
Forward Pass
def forward(self, pixel_values: torch.FloatTensor) -> torch.FloatTensor:
Inputs:
- pixel_values (
torch.FloatTensorof shape[batch_size, channels, height, width]): Preprocessed image tensors. Typically, images are resized to theimage_sizedefined in the config and normalized.
Outputs:
- last_hidden_state (
torch.FloatTensorof shape[batch_size, num_patches, hidden_size]): The sequence of hidden-states at the output of the last layer of the model.
Usage Example
from modeling_siglip import SiglipVisionConfig, SiglipVisionModel
import torch
# Initialize configuration and model
config = SiglipVisionConfig(image_size=224, patch_size=16, hidden_size=768)
model = SiglipVisionModel(config)
# Simulated preprocessed image [Batch, Channels, Height, Width]
pixel_values = torch.randn(1, 3, 224, 224)
# Get visual embeddings
with torch.no_grad():
visual_outputs = model(pixel_values)
print(visual_outputs.shape)
# Expected Output: torch.Size([1, 196, 768])
# (196 patches = (224/16)^2)
Internal Components
While the following components are managed internally by SiglipVisionModel, understanding their role is helpful for conceptualizing the vision pipeline.
SiglipVisionEmbeddings
This module handles the "patchification" of the image.
- Patch Embedding: It uses a 2D convolution with a kernel size and stride equal to the
patch_size. This effectively divides the image into a grid of non-overlapping patches. - Flattening: The grid is flattened into a sequence of tokens.
- Position Embedding: Since Transformers are permutation-invariant, a learned
position_embeddingis added to the patch embeddings to retain spatial information.
SiglipVisionTransformer
A standard Transformer Encoder stack. Each layer consists of:
- Multi-Head Attention: Allows the model to attend to different parts of the image simultaneously.
- MLP (Feed-Forward): Two linear layers with a non-linear activation (typically
gelu). - Layer Normalization: Applied before the attention and MLP blocks (Pre-Norm architecture).
Performance Note
In the context of the full VLM, the number of visual tokens is calculated as: $$\text{num_tokens} = \left(\frac{\text{image_size}}{\text{patch_size}}\right)^2$$ For a $224 \times 224$ image with a $16 \times 16$ patch size, this results in 196 tokens, which are then projected and prepended to the text tokens for the language model.