Understanding Multi-View Transformers — DUSt3R 512 Pointmap Probes

This model provides pretrained probes for analyzing the internal representations of multi-view transformers, specifically DUSt3R (final checkpoint trained at resolution 512 with a DPT output head).
The probes decode 3D pointmaps from intermediate transformer features, enabling layer-wise study of geometric reasoning.

This work accompanies the paper:

Understanding Multi-View Transformers
ICCV 2025 E2E3D Workshop

Code: https://github.com/JulienGaubil/und3rstand
Paper: https://arxiv.org/abs/2510.24907
Other pretrained probes: https://huggingface.co/jgaubil/und3rstand-dust3r-224-linear

Model Description

Backbone: DUSt3R (ViT-Large, frozen)
Probe type: 5-layer MLP, one per probed transformer layer
Task: Decode per-pixel 3D pointmaps from transformer features
Input: Two RGB images (B, 3, H, W) normalized to [-1, 1]
Output: One prediction per probed transformer layer
- pts3d: (B, H, W, 3) 3D pointmap
- conf: (B, H, W) confidence map

Usage

import requests
from PIL import Image
import torchvision.transforms as T
from src.models.probes import PointmapProbes

model, probes = PointmapProbes.load_backbone_and_probe(
    "jgaubil/und3rstand-dust3r-512-dpt"
)
model.eval()
probes.eval()

view1_path = "https://raw.githubusercontent.com/JulienGaubil/und3rstand/main/assets/samples/example_view1.jpg"
view2_path = "https://raw.githubusercontent.com/JulienGaubil/und3rstand/main/assets/samples/example_view2.jpg"
transform = T.Compose([
    T.Resize(512),
    T.CenterCrop(512),
    T.ToTensor(),
    T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
])
view1_images = transform(
    Image.open(requests.get(view1_path, stream=True).raw).convert("RGB")
).unsqueeze(0)
view2_images = transform(
    Image.open(requests.get(view2_path, stream=True).raw).convert("RGB")
).unsqueeze(0)

feat_list = model(view1_images, view2_images)
outputs = probes(feat_list)

for layer_id, (pred1, pred2) in zip(model.probed_layers.layer_ids, outputs):
    print(f"{layer_id}: pts3d={pred1['pts3d'].shape}, conf={pred1['conf'].shape}")

Citation

@inproceedings{stary2025understanding,
  title={{Understanding Multi-View Transformers}},
  author={Star{\'y}, Michal and Gaubil, Julien and Tewari, Ayush and Sitzmann, Vincent},
  booktitle={ICCV 2025 E2E3D Workshop},
  year={2025}
}

Downloads last month: 101

Inference Providers NEW

Image-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jgaubil/und3rstand-dust3r-512-dpt

Base model

naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt

Finetuned

(2)

this model

Paper for jgaubil/und3rstand-dust3r-512-dpt

Understanding Multi-View Transformers

Paper • 2510.24907 • Published Oct 28, 2025