---
tags:
- fashion
- image-retrieval
- image-to-image
- siglip
- lookbench
- embedding
- deepfashion2
library_name: open_clip
pipeline_tag: image-feature-extraction
license: mit
language:
- en
metrics:
- recall
- ndcg
datasets:
- srpone/look-bench
- DeepFashion2
---

# MODA-Fashion-DeepFashion2

**Fashion image-to-image retrieval fine-tuned on cross-domain shop↔consumer pairs.**

MODA-Fashion-DeepFashion2 is a vision-encoder fine-tuned ViT-B-16-SigLIP that achieves **66.52% Fine Recall@1** on [LookBench](https://huggingface.co/datasets/srpone/look-bench), beating FashionSigLIP by +2.68 with just 13.5K training triplets and no distillation.

## Highlights

- **+2.68 Fine R@1** over FashionSigLIP on LookBench Overall
- **+9.37 on AIGen-StreetLook** — the hardest cross-domain subset
- Trained on only 13,557 DeepFashion2 triplets (no LookBench data)
- Same architecture as FashionSigLIP — drop-in replacement
- No ensemble or distillation needed

## LookBench Results

| Model | Params | Dim | Fine R@1 | Coarse R@1 | nDCG@5 |
|---|---:|---:|---:|---:|---:|
| FashionSigLIP | 203M | 768 | 63.84 | 83.67 | 49.63 |
| **MODA-Fashion-DeepFashion2** | **203M** | **768** | **66.52** | **85.67** | **52.46** |

### Per-subset Fine Recall@1

| Subset | Queries | FashionSigLIP | Ours | Delta |
|---|---:|---:|---:|---:|
| RealStudioFlat | 1,011 | 66.96 | **69.63** | +2.67 |
| AIGen-Studio | 193 | 76.68 | **77.20** | +0.52 |
| RealStreetLook | 981 | 56.37 | **58.41** | +2.04 |
| AIGen-StreetLook | 160 | 74.38 | **83.75** | **+9.37** |
| **Overall** | **2,345** | **63.84** | **66.52** | **+2.68** |

## Model Spec

| Property | Value |
|---|---|
| **Architecture** | ViT-B/16-SigLIP (full CLIP: vision + text) |
| **Parameters** | 203.2M |
| **Embedding Dimension** | 768 |
| **Output** | L2-normalized float32 vector |
| **Model Size (safetensors)** | ~775 MB |
| **Model Size (pytorch .bin)** | ~775 MB |
| **Input Resolution** | 224 × 224 |
| **Framework** | OpenCLIP |
| **Precision** | float32 |

## Inference — Quick Start

A standalone `inference.py` is included in this directory.

```bash
# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda
```

### Python API

```python
import open_clip
import torch
import torch.nn.functional as F
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP",
    pretrained="path/to/moda-fashion-deepfashion2/open_clip_model.safetensors",
)
model.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = model.encode_image(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]
```

### Requirements

```
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
```

## Training Details

- **Base model**: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
- **Fine-tuned components**: Vision encoder only (image tower)
- **Training data**: DeepFashion2 cross-domain shop↔consumer image pairs
- **Triplets**: 13,557 train + 714 validation
- **Loss**: InfoNCE + L2 weight drift regularization
- **Temperature**: 0.07
- **Alignment weight**: 0.3
- **Optimizer**: AdamW, LR=2e-6, batch=24
- **Epochs**: 4 (best at epoch 3, val triplet accuracy = 99.6%)
- **BBox cropping**: Uses DeepFashion2 bounding box annotations for item-level crops
- **Hardware**: Apple M-series (MPS)

## Why It Works

The key insight is **cross-domain contrastive learning**. DeepFashion2 contains pairs of the *same product* photographed in two very different conditions:
- **Shop images**: Clean studio photos (white background, centered)
- **Consumer images**: In-the-wild photos (varied backgrounds, angles, lighting)

Training the vision encoder to match these pairs teaches the model to look past domain differences and focus on the product's intrinsic visual features — exactly what LookBench tests.

## Related Models

| Model | Dim | Fine R@1 | Best for |
|---|---:|---:|---|
| [MODA-Fashion-Distilled](https://huggingface.co/HopitAI/moda-fashion-distilled) | 768 | 67.63 | Best overall quality |
| [MODA-Fashion-Matryoshka](https://huggingface.co/HopitAI/moda-fashion-matryoshka) | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| [MODA-Fashion-Vision-FP16](https://huggingface.co/HopitAI/moda-fashion-vision-fp16) | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| [MODA-Fashion-Distilled-512d](https://huggingface.co/HopitAI/moda-fashion-distilled-512d) | 512 | 67.63 | Compact index, highest nDCG@5 |
| **MODA-Fashion-DeepFashion2 (this model)** | 768 | 66.52 | Simplest recipe, no distillation |


## License

MIT

## Citation

If you use this model, please cite:

```
@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
```