Quick Start

This guide will help you set up your environment and run your first multimodal inference using the vlm_from_scratch implementation of PaliGemma.

1. Prerequisites

Ensure you have Python 3.8+ installed. You will need to install the following dependencies:

pip install torch torchvision transformers pillow safetensors fire numpy

2. Prepare Model Weights

This project is designed to work with official PaliGemma weights. You can download these from Hugging Face (e.g., google/paligemma-3b-pt-224).

The directory should contain:

*.safetensors files
config.json
Tokenizer files

3. Running Inference via CLI

The easiest way to test the model is using the provided inference.py script. It uses fire to expose a command-line interface.

python inference.py \
    --model_path "/path/to/your/paligemma-weights" \
    --prompt "question: what is in this image?" \
    --image_file_path "path/to/image.jpg" \
    --max_tokens_to_generate 100 \
    --do_sample False

CLI Arguments

4. Programmatic Usage

You can also integrate the model into your own Python scripts.

Loading the Model

Use the load_hf_model utility to initialize the architecture and load the weights into the correct device.

from utils import load_hf_model

device = "cuda" # or "cpu", "mps"
model, tokenizer = load_hf_model("path/to/model_folder", device)
model.eval()

Processing Inputs

The PaliGemmaProcessor handles both image resizing/normalization and text tokenization.

from processing_paligemma import PaliGemmaProcessor

# Initialize processor
num_image_tokens = model.config.vision_config.num_image_tokens
image_size = model.config.vision_config.image_size
processor = PaliGemmaProcessor(tokenizer, num_image_tokens, image_size)

# Process a single image and prompt
from PIL import Image
image = Image.open("example.jpg")
model_inputs = processor(text=["Describe this image"], images=[image])

Generating Text

Pass the processed inputs to the model. Use the KVCache class for efficient autoregressive decoding.

from modeling_gemma import KVCache

kv_cache = KVCache()
outputs = model(
    input_ids=model_inputs["input_ids"].to(device),
    pixel_values=model_inputs["pixel_values"].to(device),
    attention_mask=model_inputs["attention_mask"].to(device),
    kv_cache=kv_cache
)

next_token_logits = outputs["logits"][:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
print(tokenizer.decode(next_token[0]))

Note on Device Support

The project automatically detects the best available backend in the following order:

CUDA (NVIDIA GPUs)
MPS (Apple Silicon)
CPU