fVLM-1.7B (Foveated Vision-Language Model)

A vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos on a single GPU.

Model Description

fVLM-1.7B is built on SmolLM2-1.7B-Instruct (language backbone) + DINOv2-small (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into 1 visual token. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.

Architecture

Component	Details
Language Model	SmolLM2-1.7B-Instruct
Vision Encoder	DINOv2-small
Attention	Deep query-guided foveated cross-attention
Visual Tokens	1 token per frame (query-compressed)
Total Parameters	~1.84B
Query Dimension	384
LLM Dimension	2048
Visual Scale	0.14

How Foveated Attention Works

Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:

DINOv2 encodes each frame into patch features and caches K/V at every layer
A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
The single output token is projected to LLM dimension and prepended to the text sequence
The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look

This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.

Inference Modes

fVLM supports three forward modes with different speed/quality tradeoffs:

Mode	Description	Use Case
`coarse_only`	Single static-query pass	Fastest; good for images and quick inference
`coarse_fine`	Two-pass parallel forward (soft attention)	Training mode; uses foveated attention
`autoregressive`	Sequential with KV cache (hard attention)	Iterative foveation for video

Benchmark Results

fVLM-1.7B (Stage 3 DPO)

Benchmark	Coarse-Only	Coarse→Fine	Autoregressive
MVBench (3800)	30.8%	29.9%	29.9%
Video-MME (2700)	30.5%	28.2%	30.4%
ScienceQA (2017)	49.0%	43.8%	46.6%

fVLM-135M (Stage 3 DPO) — for comparison

Benchmark	Coarse-Only	Coarse→Fine	Autoregressive
MVBench	27.4%	28.0%	27.9%
Video-MME	26.2%	29.5%	28.7%
ScienceQA	36.4%	35.6%	35.4%

Scaling gain (1.7B vs 135M): +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).

Training

Trained with a 3-stage pipeline (alignment, SFT, DPO) on a single A100-80GB GPU. Total training time: ~16 hours.

Stage 1: Visual Alignment (4.3h, 31,250 steps)

Objective: Align DINOv2 visual features with the SmolLM2 text embedding space
Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
Loss: Full-text cross-entropy (predict all tokens)
LR: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
Batch size: 32

Stage 2: Vision-Language SFT (9.5h, 31,250 steps)

Objective: Supervised fine-tuning on vision-language tasks
Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
Loss: Answer-only cross-entropy (mask user/system tokens)
LR: Flat 3e-5 all components with cosine decay
Batch size: 32, gradient checkpointing enabled

Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)

Objective: Align outputs with human preferences
Data: RLAIF-V (83K preference pairs)
Loss: DPO with beta=0.1
LR: 5e-7 all components
Batch size: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled

Bug Fixes in This Version

This release includes several important bug fixes over earlier checkpoints:

eos_token / ignore_index collision: The EOS token ID was colliding with the ignore_index value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
Stage 2 OOM skip rate fix: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
Benchmark letter-bias fix: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.

Files

File	Description
`checkpoint.pt`	Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format
`model.safetensors`	Model weights in safetensors format (previous version)
`model.py`	Full model architecture code
`train.py`	Training script (all 3 stages)
`data.py`	Data loading and preprocessing
`benchmark.py`	Benchmark evaluation code
`logger.py`	Logging utilities
`benchmark_results.json`	Full benchmark results with per-category breakdowns

Usage

Setup

import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")

# Build model
from model import FoveatedVLM

model = FoveatedVLM(
    llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    dino_name="facebook/dinov2-small",
    query_dim=384,
    visual_scale=0.14,
    deep_query=True,
)

# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

# Standard DINO preprocessing
frame_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Image Input

Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution.

from PIL import Image

img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img)                      # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1)   # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16)  # [1, 8, 3, 224, 224]

Video Input

For video, sample up to 64 frames uniformly. No replication needed.

tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)

Inference

messages = [
    {"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    result = model(
        frames=frames,
        input_ids=input_ids,
        attention_mask=attention_mask,
        loss_mask=loss_mask,
        mode="coarse_fine",       # or "coarse_only" or "autoregressive"
    )
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss

Citation

If you use this model, please cite:

@misc{fvlm2025,
  title={fVLM: Foveated Vision-Language Model},
  author={Sandeep Sampath Kumar},
  year={2025},
  url={https://huggingface.co/sanps/fVLM-1.7B}
}

License

Apache 2.0

Downloads last month: 59

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Accuracy (coarse_only) on MVBench
self-reported

30.800
Accuracy (coarse_only) on Video-MME
self-reported

30.500
Accuracy (coarse_only) on ScienceQA
self-reported

49.000