Quick Start
Quick Start
This guide will help you set up your environment and run your first multimodal inference using the vlm_from_scratch implementation of PaliGemma.
1. Prerequisites
Ensure you have Python 3.8+ installed. You will need to install the following dependencies:
pip install torch torchvision transformers pillow safetensors fire numpy
2. Prepare Model Weights
This project is designed to work with official PaliGemma weights. You can download these from Hugging Face (e.g., google/paligemma-3b-pt-224).
The directory should contain:
*.safetensorsfilesconfig.json- Tokenizer files
3. Running Inference via CLI
The easiest way to test the model is using the provided inference.py script. It uses fire to expose a command-line interface.
python inference.py \
--model_path "/path/to/your/paligemma-weights" \
--prompt "question: what is in this image?" \
--image_file_path "path/to/image.jpg" \
--max_tokens_to_generate 100 \
--do_sample False
CLI Arguments
| Argument | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| model_path | str | Required | Path to the local directory containing weights and config. |
| prompt | str | Required | The text prompt/question for the model. |
| image_file_path | str | Required | Path to the input image file. |
| max_tokens_to_generate | int | 100 | Maximum number of new tokens to predict. |
| temperature | float | 0.8 | Softmax temperature for sampling. |
| top_p | float | 0.9 | Top-p (nucleus) sampling threshold. |
| do_sample | bool | False | Whether to use sampling (True) or greedy search (False). |
| only_cpu | bool | False | Force inference on CPU even if CUDA/MPS is available. |
4. Programmatic Usage
You can also integrate the model into your own Python scripts.
Loading the Model
Use the load_hf_model utility to initialize the architecture and load the weights into the correct device.
from utils import load_hf_model
device = "cuda" # or "cpu", "mps"
model, tokenizer = load_hf_model("path/to/model_folder", device)
model.eval()
Processing Inputs
The PaliGemmaProcessor handles both image resizing/normalization and text tokenization.
from processing_paligemma import PaliGemmaProcessor
# Initialize processor
num_image_tokens = model.config.vision_config.num_image_tokens
image_size = model.config.vision_config.image_size
processor = PaliGemmaProcessor(tokenizer, num_image_tokens, image_size)
# Process a single image and prompt
from PIL import Image
image = Image.open("example.jpg")
model_inputs = processor(text=["Describe this image"], images=[image])
Generating Text
Pass the processed inputs to the model. Use the KVCache class for efficient autoregressive decoding.
from modeling_gemma import KVCache
kv_cache = KVCache()
outputs = model(
input_ids=model_inputs["input_ids"].to(device),
pixel_values=model_inputs["pixel_values"].to(device),
attention_mask=model_inputs["attention_mask"].to(device),
kv_cache=kv_cache
)
next_token_logits = outputs["logits"][:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
print(tokenizer.decode(next_token[0]))
Note on Device Support
The project automatically detects the best available backend in the following order:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- CPU