Multimodal Processor

The PaliGemmaProcessor is the primary interface for preparing data for the PaliGemma model. It synchronizes the preprocessing of both visual and textual inputs, ensuring that image pixels and text tokens are correctly formatted, normalized, and aligned before being passed to the model.

Initialization

The processor requires an existing tokenizer and configuration parameters derived from the vision backbone. During initialization, it automatically extends the tokenizer vocabulary with special tokens required for multimodal tasks (e.g., <image>) and specialized tasks like object detection and segmentation.

from transformers import AutoTokenizer
from processing_paligemma import PaliGemmaProcessor

tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
processor = PaliGemmaProcessor(
    tokenizer=tokenizer,
    num_image_tokens=256, # Number of patches from the vision encoder
    image_size=224        # Expected input resolution
)

Main Interface: `call`

The processor is used as a callable. It takes a prompt and an image, returning a dictionary of tensors ready for model consumption.

Usage Example

from PIL import Image

# Load inputs
image = Image.open("sample_image.jpg")
prompt = "caption this image"

# Process inputs
model_inputs = processor(
    text=[prompt], 
    images=[image]
)

# Accessing processed tensors
input_ids = model_inputs["input_ids"]           # [Batch, Seq_Len]
pixel_values = model_inputs["pixel_values"]     # [Batch, Channel, Height, Width]
attention_mask = model_inputs["attention_mask"] # [Batch, Seq_Len]

API Reference

Parameters:

Returns:

dict: A dictionary containing:
- pixel_values: Normalized and rescaled image tensors.
- input_ids: Tokenized prompt prefixed with a fixed number of <image> tokens and standard special tokens.
- attention_mask: Mask identifying valid tokens versus padding.

Preprocessing Logic

The PaliGemmaProcessor encapsulates several internal transformation steps to ensure compatibility with the model weights:

Image Processing

The processor handles the following transformations via internal utility functions:

Resizing: Images are resized to the target image_size using Bicubic resampling.
Rescaling: Pixel values are scaled by 1/255.0 to the [0, 1] range.
Normalization: Standard ImageNet-style normalization is applied with mean 0.5 and standard deviation 0.5.
Permutation: Channels are moved to the first dimension ([C, H, W]).

Text Formatting

The processor modifies the input text to match the specific template PaliGemma was trained on:

Image Token Prepending: Adds a fixed sequence of <image> tokens to the start of the prompt.
Special Tokens: Inserts the <bos> (beginning of stream) token.
Formatting: Appends a newline (\n) to the prompt to signal the end of the instruction.
Task Tokens: Includes <loc0000>-<loc1023> for detection and <seg000>-<seg127> for segmentation tasks.

Multimodal Processor