rawalkhirodkar's picture
Upload folder using huggingface_hub
878dbfe verified
metadata
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
pipeline_tag: image-feature-extraction
library_name: sapiens
tags:
  - sapiens
  - sapiens2
  - vision-transformer
  - human-centric
  - pretrained-backbone
  - feature-extraction

Sapiens2-0.1B

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This repository contains the 0.1B parameter pretrained backbone (114M params). It produces dense per-patch features suitable for fine-tuning downstream task heads.

Model Details

  • Developed by: Meta
  • Model type: Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
  • License: Sapiens2 License
  • Task: pretrain
  • Format: safetensors
  • File: sapiens2_0.1b_pretrain.safetensors

Quick Start

Install the Sapiens2 repo (pip install -e .).

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_0.1b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-0.1b", filename="sapiens2_0.1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)

Model Card

Field Value
Architecture Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
Parameters 0.114 B
FLOPs 0.342 T
Embedding dim 768
Layers 12
Attention heads 12
Pretraining resolution 1024 × 768 (H × W)
Patch size 16
Pretraining data 1B human images

Sapiens2 Family

Model Params FLOPs Embed dim Layers Heads
Sapiens2-0.1B (this) 0.114 B 0.342 T 768 12 12
Sapiens2-0.4B 0.398 B 1.260 T 1024 24 16
Sapiens2-0.8B 0.818 B 2.592 T 1280 32 16
Sapiens2-1B 1.462 B 4.715 T 1536 40 24
Sapiens2-5B 5.071 B 15.722 T 2432 56 32

See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

Intended Use

  • Feature extraction for human-centric downstream tasks
  • Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
  • Research on human-centric vision

License

Released under the Sapiens2 License.

Citation

@inproceedings{khirodkar2026sapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}