Multimodal Processor
Multimodal Processor
The PaliGemmaProcessor is the primary interface for preparing data for the PaliGemma model. It synchronizes the preprocessing of both visual and textual inputs, ensuring that image pixels and text tokens are correctly formatted, normalized, and aligned before being passed to the model.
Initialization
The processor requires an existing tokenizer and configuration parameters derived from the vision backbone. During initialization, it automatically extends the tokenizer vocabulary with special tokens required for multimodal tasks (e.g., <image>) and specialized tasks like object detection and segmentation.
from transformers import AutoTokenizer
from processing_paligemma import PaliGemmaProcessor
tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
processor = PaliGemmaProcessor(
tokenizer=tokenizer,
num_image_tokens=256, # Number of patches from the vision encoder
image_size=224 # Expected input resolution
)
Main Interface: __call__
The processor is used as a callable. It takes a prompt and an image, returning a dictionary of tensors ready for model consumption.
Usage Example
from PIL import Image
# Load inputs
image = Image.open("sample_image.jpg")
prompt = "caption this image"
# Process inputs
model_inputs = processor(
text=[prompt],
images=[image]
)
# Accessing processed tensors
input_ids = model_inputs["input_ids"] # [Batch, Seq_Len]
pixel_values = model_inputs["pixel_values"] # [Batch, Channel, Height, Width]
attention_mask = model_inputs["attention_mask"] # [Batch, Seq_Len]
API Reference
Parameters:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| text | List[str] | A list of strings containing the text prompts. |
| images | List[Image.Image] | A list of PIL Images to be processed. |
| padding | str | Padding strategy (defaults to "longest"). |
| truncation | bool | Whether to truncate sequences longer than the model's max length. |
Returns:
dict: A dictionary containing:pixel_values: Normalized and rescaled image tensors.input_ids: Tokenized prompt prefixed with a fixed number of<image>tokens and standard special tokens.attention_mask: Mask identifying valid tokens versus padding.
Preprocessing Logic
The PaliGemmaProcessor encapsulates several internal transformation steps to ensure compatibility with the model weights:
Image Processing
The processor handles the following transformations via internal utility functions:
- Resizing: Images are resized to the target
image_sizeusing Bicubic resampling. - Rescaling: Pixel values are scaled by
1/255.0to the[0, 1]range. - Normalization: Standard ImageNet-style normalization is applied with mean
0.5and standard deviation0.5. - Permutation: Channels are moved to the first dimension (
[C, H, W]).
Text Formatting
The processor modifies the input text to match the specific template PaliGemma was trained on:
- Image Token Prepending: Adds a fixed sequence of
<image>tokens to the start of the prompt. - Special Tokens: Inserts the
<bos>(beginning of stream) token. - Formatting: Appends a newline (
\n) to the prompt to signal the end of the instruction. - Task Tokens: Includes
<loc0000>-<loc1023>for detection and<seg000>-<seg127>for segmentation tasks.