MODA-Fashion-DeepFashion2

Fashion image-to-image retrieval fine-tuned on cross-domain shop↔consumer pairs.

MODA-Fashion-DeepFashion2 is a vision-encoder fine-tuned ViT-B-16-SigLIP that achieves 66.52% Fine Recall@1 on LookBench, beating FashionSigLIP by +2.68 with just 13.5K training triplets and no distillation.

Highlights

+2.68 Fine R@1 over FashionSigLIP on LookBench Overall
+9.37 on AIGen-StreetLook — the hardest cross-domain subset
Trained on only 13,557 DeepFashion2 triplets (no LookBench data)
Same architecture as FashionSigLIP — drop-in replacement
No ensemble or distillation needed

LookBench Results

Model	Params	Dim	Fine R@1	Coarse R@1	nDCG@5
FashionSigLIP	203M	768	63.84	83.67	49.63
MODA-Fashion-DeepFashion2	203M	768	66.52	85.67	52.46

Per-subset Fine Recall@1

Subset	Queries	FashionSigLIP	Ours	Delta
RealStudioFlat	1,011	66.96	69.63	+2.67
AIGen-Studio	193	76.68	77.20	+0.52
RealStreetLook	981	56.37	58.41	+2.04
AIGen-StreetLook	160	74.38	83.75	+9.37
Overall	2,345	63.84	66.52	+2.68

Model Spec

Property	Value
Architecture	ViT-B/16-SigLIP (full CLIP: vision + text)
Parameters	203.2M
Embedding Dimension	768
Output	L2-normalized float32 vector
Model Size (safetensors)	~775 MB
Model Size (pytorch .bin)	~775 MB
Input Resolution	224 × 224
Framework	OpenCLIP
Precision	float32

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn.functional as F
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP",
    pretrained="path/to/moda-fashion-deepfashion2/open_clip_model.safetensors",
)
model.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = model.encode_image(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
Fine-tuned components: Vision encoder only (image tower)
Training data: DeepFashion2 cross-domain shop↔consumer image pairs
Triplets: 13,557 train + 714 validation
Loss: InfoNCE + L2 weight drift regularization
Temperature: 0.07
Alignment weight: 0.3
Optimizer: AdamW, LR=2e-6, batch=24
Epochs: 4 (best at epoch 3, val triplet accuracy = 99.6%)
BBox cropping: Uses DeepFashion2 bounding box annotations for item-level crops
Hardware: Apple M-series (MPS)

Why It Works

The key insight is cross-domain contrastive learning. DeepFashion2 contains pairs of the same product photographed in two very different conditions:

Shop images: Clean studio photos (white background, centered)
Consumer images: In-the-wild photos (varied backgrounds, angles, lighting)

Training the vision encoder to match these pairs teaches the model to look past domain differences and focus on the product's intrinsic visual features — exactly what LookBench tests.

Related Models

Model	Dim	Fine R@1	Best for
MODA-Fashion-Distilled	768	67.63	Best overall quality
MODA-Fashion-Matryoshka	64-768	67.42 (256d)	Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16	768	67.42	Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d	512	67.63	Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2 (this model)	768	66.52	Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}

Downloads last month: 141

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

HopitAI
/

moda-fashion-deepfashion2