fVLM-1.7B (Foveated Vision-Language Model)

A vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos on a single GPU.

Model Description

fVLM-1.7B is built on SmolLM2-1.7B-Instruct (language backbone) + DINOv2-small (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into 1 visual token. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.

Architecture

Component Details
Language Model SmolLM2-1.7B-Instruct
Vision Encoder DINOv2-small
Attention Deep query-guided foveated cross-attention
Visual Tokens 1 token per frame (query-compressed)
Total Parameters ~1.84B
Query Dimension 384
LLM Dimension 2048
Visual Scale 0.14

How Foveated Attention Works

Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:

  1. DINOv2 encodes each frame into patch features and caches K/V at every layer
  2. A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
  3. The single output token is projected to LLM dimension and prepended to the text sequence
  4. The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look

This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.

Inference Modes

fVLM supports three forward modes with different speed/quality tradeoffs:

Mode Description Use Case
coarse_only Single static-query pass Fastest; good for images and quick inference
coarse_fine Two-pass parallel forward (soft attention) Training mode; uses foveated attention
autoregressive Sequential with KV cache (hard attention) Iterative foveation for video

Benchmark Results

fVLM-1.7B (Stage 3 DPO)

Benchmark Coarse-Only Coarse→Fine Autoregressive
MVBench (3800) 30.8% 29.9% 29.9%
Video-MME (2700) 30.5% 28.2% 30.4%
ScienceQA (2017) 49.0% 43.8% 46.6%

fVLM-135M (Stage 3 DPO) — for comparison

Benchmark Coarse-Only Coarse→Fine Autoregressive
MVBench 27.4% 28.0% 27.9%
Video-MME 26.2% 29.5% 28.7%
ScienceQA 36.4% 35.6% 35.4%

Scaling gain (1.7B vs 135M): +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).

Training

Trained with a 3-stage pipeline (alignment, SFT, DPO) on a single A100-80GB GPU. Total training time: ~16 hours.

Stage 1: Visual Alignment (4.3h, 31,250 steps)

  • Objective: Align DINOv2 visual features with the SmolLM2 text embedding space
  • Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
  • Loss: Full-text cross-entropy (predict all tokens)
  • LR: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
  • Batch size: 32

Stage 2: Vision-Language SFT (9.5h, 31,250 steps)

  • Objective: Supervised fine-tuning on vision-language tasks
  • Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
  • Loss: Answer-only cross-entropy (mask user/system tokens)
  • LR: Flat 3e-5 all components with cosine decay
  • Batch size: 32, gradient checkpointing enabled

Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)

  • Objective: Align outputs with human preferences
  • Data: RLAIF-V (83K preference pairs)
  • Loss: DPO with beta=0.1
  • LR: 5e-7 all components
  • Batch size: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled

Bug Fixes in This Version

This release includes several important bug fixes over earlier checkpoints:

  1. eos_token / ignore_index collision: The EOS token ID was colliding with the ignore_index value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.

  2. Stage 2 OOM skip rate fix: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.

  3. Benchmark letter-bias fix: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.

Files

File Description
checkpoint.pt Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format
model.safetensors Model weights in safetensors format (previous version)
model.py Full model architecture code
train.py Training script (all 3 stages)
data.py Data loading and preprocessing
benchmark.py Benchmark evaluation code
logger.py Logging utilities
benchmark_results.json Full benchmark results with per-category breakdowns

Usage

Setup

import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")

# Build model
from model import FoveatedVLM

model = FoveatedVLM(
    llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    dino_name="facebook/dinov2-small",
    query_dim=384,
    visual_scale=0.14,
    deep_query=True,
)

# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

# Standard DINO preprocessing
frame_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Image Input

Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution.

from PIL import Image

img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img)                      # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1)   # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16)  # [1, 8, 3, 224, 224]

Video Input

For video, sample up to 64 frames uniformly. No replication needed.

tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)

Inference

messages = [
    {"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    result = model(
        frames=frames,
        input_ids=input_ids,
        attention_mask=attention_mask,
        loss_mask=loss_mask,
        mode="coarse_fine",       # or "coarse_only" or "autoregressive"
    )
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss

Citation

If you use this model, please cite:

@misc{fvlm2025,
  title={fVLM: Foveated Vision-Language Model},
  author={Sandeep Sampath Kumar},
  year={2025},
  url={https://huggingface.co/sanps/fVLM-1.7B}
}

License

Apache 2.0

Downloads last month
59
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results