MODA-Fashion-DeepFashion2
Fashion image-to-image retrieval fine-tuned on cross-domain shop↔consumer pairs.
MODA-Fashion-DeepFashion2 is a vision-encoder fine-tuned ViT-B-16-SigLIP that achieves 66.52% Fine Recall@1 on LookBench, beating FashionSigLIP by +2.68 with just 13.5K training triplets and no distillation.
Highlights
- +2.68 Fine R@1 over FashionSigLIP on LookBench Overall
- +9.37 on AIGen-StreetLook — the hardest cross-domain subset
- Trained on only 13,557 DeepFashion2 triplets (no LookBench data)
- Same architecture as FashionSigLIP — drop-in replacement
- No ensemble or distillation needed
LookBench Results
| Model | Params | Dim | Fine R@1 | Coarse R@1 | nDCG@5 |
|---|---|---|---|---|---|
| FashionSigLIP | 203M | 768 | 63.84 | 83.67 | 49.63 |
| MODA-Fashion-DeepFashion2 | 203M | 768 | 66.52 | 85.67 | 52.46 |
Per-subset Fine Recall@1
| Subset | Queries | FashionSigLIP | Ours | Delta |
|---|---|---|---|---|
| RealStudioFlat | 1,011 | 66.96 | 69.63 | +2.67 |
| AIGen-Studio | 193 | 76.68 | 77.20 | +0.52 |
| RealStreetLook | 981 | 56.37 | 58.41 | +2.04 |
| AIGen-StreetLook | 160 | 74.38 | 83.75 | +9.37 |
| Overall | 2,345 | 63.84 | 66.52 | +2.68 |
Model Spec
| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP (full CLIP: vision + text) |
| Parameters | 203.2M |
| Embedding Dimension | 768 |
| Output | L2-normalized float32 vector |
| Model Size (safetensors) | ~775 MB |
| Model Size (pytorch .bin) | ~775 MB |
| Input Resolution | 224 × 224 |
| Framework | OpenCLIP |
| Precision | float32 |
Inference — Quick Start
A standalone inference.py is included in this directory.
# Single image → 768-d embedding
python inference.py --image query.jpg
# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity
# Run on GPU/MPS
python inference.py --image query.jpg --device cuda
Python API
import open_clip
import torch
import torch.nn.functional as F
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP",
pretrained="path/to/moda-fashion-deepfashion2/open_clip_model.safetensors",
)
model.eval()
image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
features = model.encode_image(image)
features = F.normalize(features, p=2, dim=-1) # [1, 768]
Requirements
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
Training Details
- Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
- Fine-tuned components: Vision encoder only (image tower)
- Training data: DeepFashion2 cross-domain shop↔consumer image pairs
- Triplets: 13,557 train + 714 validation
- Loss: InfoNCE + L2 weight drift regularization
- Temperature: 0.07
- Alignment weight: 0.3
- Optimizer: AdamW, LR=2e-6, batch=24
- Epochs: 4 (best at epoch 3, val triplet accuracy = 99.6%)
- BBox cropping: Uses DeepFashion2 bounding box annotations for item-level crops
- Hardware: Apple M-series (MPS)
Why It Works
The key insight is cross-domain contrastive learning. DeepFashion2 contains pairs of the same product photographed in two very different conditions:
- Shop images: Clean studio photos (white background, centered)
- Consumer images: In-the-wild photos (varied backgrounds, angles, lighting)
Training the vision encoder to match these pairs teaches the model to look past domain differences and focus on the product's intrinsic visual features — exactly what LookBench tests.
Related Models
| Model | Dim | Fine R@1 | Best for |
|---|---|---|---|
| MODA-Fashion-Distilled | 768 | 67.63 | Best overall quality |
| MODA-Fashion-Matryoshka | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| MODA-Fashion-Vision-FP16 | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| MODA-Fashion-Distilled-512d | 512 | 67.63 | Compact index, highest nDCG@5 |
| MODA-Fashion-DeepFashion2 (this model) | 768 | 66.52 | Simplest recipe, no distillation |
License
MIT
Citation
If you use this model, please cite:
@software{moda2026,
title = {MODA: Open-source benchmark and models for fashion search},
author = {Hopit AI},
year = {2026},
url = {https://github.com/hopit-ai/Moda}
}
- Downloads last month
- 141
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support