fVLM-1.7B (Foveated Vision-Language Model)
A vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos on a single GPU.
Model Description
fVLM-1.7B is built on SmolLM2-1.7B-Instruct (language backbone) + DINOv2-small (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into 1 visual token. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.
Architecture
| Component | Details |
|---|---|
| Language Model | SmolLM2-1.7B-Instruct |
| Vision Encoder | DINOv2-small |
| Attention | Deep query-guided foveated cross-attention |
| Visual Tokens | 1 token per frame (query-compressed) |
| Total Parameters | ~1.84B |
| Query Dimension | 384 |
| LLM Dimension | 2048 |
| Visual Scale | 0.14 |
How Foveated Attention Works
Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:
- DINOv2 encodes each frame into patch features and caches K/V at every layer
- A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
- The single output token is projected to LLM dimension and prepended to the text sequence
- The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look
This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.
Inference Modes
fVLM supports three forward modes with different speed/quality tradeoffs:
| Mode | Description | Use Case |
|---|---|---|
coarse_only |
Single static-query pass | Fastest; good for images and quick inference |
coarse_fine |
Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
autoregressive |
Sequential with KV cache (hard attention) | Iterative foveation for video |
Benchmark Results
fVLM-1.7B (Stage 3 DPO)
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|---|---|---|---|
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
fVLM-135M (Stage 3 DPO) — for comparison
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|---|---|---|---|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 36.4% | 35.6% | 35.4% |
Scaling gain (1.7B vs 135M): +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
Training
Trained with a 3-stage pipeline (alignment, SFT, DPO) on a single A100-80GB GPU. Total training time: ~16 hours.
Stage 1: Visual Alignment (4.3h, 31,250 steps)
- Objective: Align DINOv2 visual features with the SmolLM2 text embedding space
- Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- Loss: Full-text cross-entropy (predict all tokens)
- LR: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
- Batch size: 32
Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
- Objective: Supervised fine-tuning on vision-language tasks
- Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- Loss: Answer-only cross-entropy (mask user/system tokens)
- LR: Flat 3e-5 all components with cosine decay
- Batch size: 32, gradient checkpointing enabled
Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
- Objective: Align outputs with human preferences
- Data: RLAIF-V (83K preference pairs)
- Loss: DPO with beta=0.1
- LR: 5e-7 all components
- Batch size: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
Bug Fixes in This Version
This release includes several important bug fixes over earlier checkpoints:
eos_token/ignore_indexcollision: The EOS token ID was colliding with theignore_indexvalue used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.Stage 2 OOM skip rate fix: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
Benchmark letter-bias fix: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.
Files
| File | Description |
|---|---|
checkpoint.pt |
Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
model.safetensors |
Model weights in safetensors format (previous version) |
model.py |
Full model architecture code |
train.py |
Training script (all 3 stages) |
data.py |
Data loading and preprocessing |
benchmark.py |
Benchmark evaluation code |
logger.py |
Logging utilities |
benchmark_results.json |
Full benchmark results with per-category breakdowns |
Usage
Setup
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")
# Build model
from model import FoveatedVLM
model = FoveatedVLM(
llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
dino_name="facebook/dinov2-small",
query_dim=384,
visual_scale=0.14,
deep_query=True,
)
# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
# Standard DINO preprocessing
frame_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
Image Input
Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution.
from PIL import Image
img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img) # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
Video Input
For video, sample up to 64 frames uniformly. No replication needed.
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
Inference
messages = [
{"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
result = model(
frames=frames,
input_ids=input_ids,
attention_mask=attention_mask,
loss_mask=loss_mask,
mode="coarse_fine", # or "coarse_only" or "autoregressive"
)
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
Citation
If you use this model, please cite:
@misc{fvlm2025,
title={fVLM: Foveated Vision-Language Model},
author={Sandeep Sampath Kumar},
year={2025},
url={https://huggingface.co/sanps/fVLM-1.7B}
}
License
Apache 2.0
- Downloads last month
- 59
Evaluation results
- Accuracy (coarse_only) on MVBenchself-reported30.800
- Accuracy (coarse_only) on Video-MMEself-reported30.500
- Accuracy (coarse_only) on ScienceQAself-reported49.000