MODA-Fashion-DeepFashion2

Fashion image-to-image retrieval fine-tuned on cross-domain shop↔consumer pairs.

MODA-Fashion-DeepFashion2 is a vision-encoder fine-tuned ViT-B-16-SigLIP that achieves 66.52% Fine Recall@1 on LookBench, beating FashionSigLIP by +2.68 with just 13.5K training triplets and no distillation.

Highlights

  • +2.68 Fine R@1 over FashionSigLIP on LookBench Overall
  • +9.37 on AIGen-StreetLook — the hardest cross-domain subset
  • Trained on only 13,557 DeepFashion2 triplets (no LookBench data)
  • Same architecture as FashionSigLIP — drop-in replacement
  • No ensemble or distillation needed

LookBench Results

Model Params Dim Fine R@1 Coarse R@1 nDCG@5
FashionSigLIP 203M 768 63.84 83.67 49.63
MODA-Fashion-DeepFashion2 203M 768 66.52 85.67 52.46

Per-subset Fine Recall@1

Subset Queries FashionSigLIP Ours Delta
RealStudioFlat 1,011 66.96 69.63 +2.67
AIGen-Studio 193 76.68 77.20 +0.52
RealStreetLook 981 56.37 58.41 +2.04
AIGen-StreetLook 160 74.38 83.75 +9.37
Overall 2,345 63.84 66.52 +2.68

Model Spec

Property Value
Architecture ViT-B/16-SigLIP (full CLIP: vision + text)
Parameters 203.2M
Embedding Dimension 768
Output L2-normalized float32 vector
Model Size (safetensors) ~775 MB
Model Size (pytorch .bin) ~775 MB
Input Resolution 224 × 224
Framework OpenCLIP
Precision float32

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn.functional as F
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP",
    pretrained="path/to/moda-fashion-deepfashion2/open_clip_model.safetensors",
)
model.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = model.encode_image(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

  • Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
  • Fine-tuned components: Vision encoder only (image tower)
  • Training data: DeepFashion2 cross-domain shop↔consumer image pairs
  • Triplets: 13,557 train + 714 validation
  • Loss: InfoNCE + L2 weight drift regularization
  • Temperature: 0.07
  • Alignment weight: 0.3
  • Optimizer: AdamW, LR=2e-6, batch=24
  • Epochs: 4 (best at epoch 3, val triplet accuracy = 99.6%)
  • BBox cropping: Uses DeepFashion2 bounding box annotations for item-level crops
  • Hardware: Apple M-series (MPS)

Why It Works

The key insight is cross-domain contrastive learning. DeepFashion2 contains pairs of the same product photographed in two very different conditions:

  • Shop images: Clean studio photos (white background, centered)
  • Consumer images: In-the-wild photos (varied backgrounds, angles, lighting)

Training the vision encoder to match these pairs teaches the model to look past domain differences and focus on the product's intrinsic visual features — exactly what LookBench tests.

Related Models

Model Dim Fine R@1 Best for
MODA-Fashion-Distilled 768 67.63 Best overall quality
MODA-Fashion-Matryoshka 64-768 67.42 (256d) Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16 768 67.42 Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d 512 67.63 Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2 (this model) 768 66.52 Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
Downloads last month
141
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HopitAI/moda-fashion-deepfashion2