Image Preprocessing Pipeline

The image preprocessing pipeline transforms raw images into a numerical format compatible with the SigLIP vision encoder. This process ensures that input data is consistently sized, scaled, and normalized according to the model's training requirements.

Overview

The primary interface for image preprocessing is the PaliGemmaProcessor. While it handles both text and vision data, the image pipeline specifically focuses on converting PIL.Image objects into normalized PyTorch tensors of shape [Batch_Size, Channels, Height, Width].

Preprocessing Steps

The pipeline executes the following transformations in sequence:

Resizing: Images are resized to the target dimensions defined in the model configuration (e.g., 224x224) using Bicubic interpolation.
Rescaling: Pixel values are rescaled from the integer range [0, 255] to a floating-point range of [0, 1] by applying a factor of 1/255.0.
Normalization: The pipeline applies standard ImageNet statistics to stabilize the input distribution.
- Mean: [0.5, 0.5, 0.5]
- Std: [0.5, 0.5, 0.5]
- Formula: (pixel - mean) / std
Channel Transposition: Input images are typically in [Height, Width, Channels] format. The pipeline transposes them to [Channels, Height, Width] to meet the requirements of PyTorch's convolutional layers.

The PaliGemmaProcessor

The PaliGemmaProcessor is the high-level class used during inference to prepare inputs.

Usage Example

from PIL import Image
from processing_paligemma import PaliGemmaProcessor

# Initialize the processor
# num_image_tokens usually matches (image_size // patch_size)**2
processor = PaliGemmaProcessor(tokenizer, num_image_tokens=256, image_size=224)

# Load an image
image = Image.open("path/to/image.jpg")

# Process text and image
# Returns a dictionary containing 'pixel_values', 'input_ids', and 'attention_mask'
inputs = processor(text=["Describe this image"], images=[image])

pixel_values = inputs["pixel_values"] 
print(pixel_values.shape) # Output: torch.Size([1, 3, 224, 224])

API Reference

`process_images` (Internal Utility)

While generally called via the processor, the process_images function encapsulates the core transformation logic.

Returns:

List[np.ndarray]: A list of processed images, each with shape [Channels, Height, Width].

Technical Implementation Details

Data Types: The pipeline casts pixel values to np.float32 during rescaling to maintain precision.
Batching: The PaliGemmaProcessor.__call__ method automatically stacks processed images into a single torch.Tensor with a batch dimension.
Padding/Truncation: Currently, the implementation is optimized for single-image and single-prompt inference.

Image Preprocessing Pipeline