Image Preprocessing Pipeline
Image Preprocessing Pipeline
The image preprocessing pipeline transforms raw images into a numerical format compatible with the SigLIP vision encoder. This process ensures that input data is consistently sized, scaled, and normalized according to the model's training requirements.
Overview
The primary interface for image preprocessing is the PaliGemmaProcessor. While it handles both text and vision data, the image pipeline specifically focuses on converting PIL.Image objects into normalized PyTorch tensors of shape [Batch_Size, Channels, Height, Width].
Preprocessing Steps
The pipeline executes the following transformations in sequence:
- Resizing: Images are resized to the target dimensions defined in the model configuration (e.g., 224x224) using Bicubic interpolation.
- Rescaling: Pixel values are rescaled from the integer range
[0, 255]to a floating-point range of[0, 1]by applying a factor of1/255.0. - Normalization: The pipeline applies standard ImageNet statistics to stabilize the input distribution.
- Mean:
[0.5, 0.5, 0.5] - Std:
[0.5, 0.5, 0.5] - Formula:
(pixel - mean) / std
- Mean:
- Channel Transposition: Input images are typically in
[Height, Width, Channels]format. The pipeline transposes them to[Channels, Height, Width]to meet the requirements of PyTorch's convolutional layers.
The PaliGemmaProcessor
The PaliGemmaProcessor is the high-level class used during inference to prepare inputs.
Usage Example
from PIL import Image
from processing_paligemma import PaliGemmaProcessor
# Initialize the processor
# num_image_tokens usually matches (image_size // patch_size)**2
processor = PaliGemmaProcessor(tokenizer, num_image_tokens=256, image_size=224)
# Load an image
image = Image.open("path/to/image.jpg")
# Process text and image
# Returns a dictionary containing 'pixel_values', 'input_ids', and 'attention_mask'
inputs = processor(text=["Describe this image"], images=[image])
pixel_values = inputs["pixel_values"]
print(pixel_values.shape) # Output: torch.Size([1, 3, 224, 224])
API Reference
process_images (Internal Utility)
While generally called via the processor, the process_images function encapsulates the core transformation logic.
| Parameter | Type | Description |
| :--- | :--- | :--- |
| images | List[PIL.Image.Image] | A list of images to process. |
| size | Tuple[int, int] | Target (height, width) for resizing. |
| resample | Image.Resampling | Interpolation method (defaults to Bicubic). |
| rescale_factor | float | Scaling factor (defaults to 1/255.0). |
| image_mean | List[float] | Sequence of means for normalization. |
| image_std | List[float] | Sequence of standard deviations for normalization. |
Returns:
List[np.ndarray]: A list of processed images, each with shape[Channels, Height, Width].
Technical Implementation Details
- Data Types: The pipeline casts pixel values to
np.float32during rescaling to maintain precision. - Batching: The
PaliGemmaProcessor.__call__method automatically stacks processed images into a singletorch.Tensorwith a batch dimension. - Padding/Truncation: Currently, the implementation is optimized for single-image and single-prompt inference.