Understanding Multi-View Transformers
Paper
โข 2510.24907 โข Published
This model provides pretrained probes for analyzing the internal representations of multi-view transformers, specifically DUSt3R
(final checkpoint trained at resolution 512 with a DPT output head).
The probes decode 3D pointmaps from intermediate transformer features, enabling layer-wise study of geometric reasoning.
This work accompanies the paper:
Understanding Multi-View Transformers
ICCV 2025 E2E3D Workshop
(B, 3, H, W) normalized to [-1, 1]pts3d: (B, H, W, 3) 3D pointmapconf: (B, H, W) confidence mapimport requests
from PIL import Image
import torchvision.transforms as T
from src.models.probes import PointmapProbes
model, probes = PointmapProbes.load_backbone_and_probe(
"jgaubil/und3rstand-dust3r-512-dpt"
)
model.eval()
probes.eval()
view1_path = "https://raw.githubusercontent.com/JulienGaubil/und3rstand/main/assets/samples/example_view1.jpg"
view2_path = "https://raw.githubusercontent.com/JulienGaubil/und3rstand/main/assets/samples/example_view2.jpg"
transform = T.Compose([
T.Resize(512),
T.CenterCrop(512),
T.ToTensor(),
T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
])
view1_images = transform(
Image.open(requests.get(view1_path, stream=True).raw).convert("RGB")
).unsqueeze(0)
view2_images = transform(
Image.open(requests.get(view2_path, stream=True).raw).convert("RGB")
).unsqueeze(0)
feat_list = model(view1_images, view2_images)
outputs = probes(feat_list)
for layer_id, (pred1, pred2) in zip(model.probed_layers.layer_ids, outputs):
print(f"{layer_id}: pts3d={pred1['pts3d'].shape}, conf={pred1['conf'].shape}")
@inproceedings{stary2025understanding,
title={{Understanding Multi-View Transformers}},
author={Star{\'y}, Michal and Gaubil, Julien and Tewari, Ayush and Sitzmann, Vincent},
booktitle={ICCV 2025 E2E3D Workshop},
year={2025}
}
Base model
naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt