--- tags: - fashion - image-retrieval - image-to-image - siglip - lookbench - embedding - deepfashion2 library_name: open_clip pipeline_tag: image-feature-extraction license: mit language: - en metrics: - recall - ndcg datasets: - srpone/look-bench - DeepFashion2 --- # MODA-Fashion-DeepFashion2 **Fashion image-to-image retrieval fine-tuned on cross-domain shop↔consumer pairs.** MODA-Fashion-DeepFashion2 is a vision-encoder fine-tuned ViT-B-16-SigLIP that achieves **66.52% Fine Recall@1** on [LookBench](https://huggingface.co/datasets/srpone/look-bench), beating FashionSigLIP by +2.68 with just 13.5K training triplets and no distillation. ## Highlights - **+2.68 Fine R@1** over FashionSigLIP on LookBench Overall - **+9.37 on AIGen-StreetLook** — the hardest cross-domain subset - Trained on only 13,557 DeepFashion2 triplets (no LookBench data) - Same architecture as FashionSigLIP — drop-in replacement - No ensemble or distillation needed ## LookBench Results | Model | Params | Dim | Fine R@1 | Coarse R@1 | nDCG@5 | |---|---:|---:|---:|---:|---:| | FashionSigLIP | 203M | 768 | 63.84 | 83.67 | 49.63 | | **MODA-Fashion-DeepFashion2** | **203M** | **768** | **66.52** | **85.67** | **52.46** | ### Per-subset Fine Recall@1 | Subset | Queries | FashionSigLIP | Ours | Delta | |---|---:|---:|---:|---:| | RealStudioFlat | 1,011 | 66.96 | **69.63** | +2.67 | | AIGen-Studio | 193 | 76.68 | **77.20** | +0.52 | | RealStreetLook | 981 | 56.37 | **58.41** | +2.04 | | AIGen-StreetLook | 160 | 74.38 | **83.75** | **+9.37** | | **Overall** | **2,345** | **63.84** | **66.52** | **+2.68** | ## Model Spec | Property | Value | |---|---| | **Architecture** | ViT-B/16-SigLIP (full CLIP: vision + text) | | **Parameters** | 203.2M | | **Embedding Dimension** | 768 | | **Output** | L2-normalized float32 vector | | **Model Size (safetensors)** | ~775 MB | | **Model Size (pytorch .bin)** | ~775 MB | | **Input Resolution** | 224 × 224 | | **Framework** | OpenCLIP | | **Precision** | float32 | ## Inference — Quick Start A standalone `inference.py` is included in this directory. ```bash # Single image → 768-d embedding python inference.py --image query.jpg # Two images → embeddings + cosine similarity python inference.py --image img1.jpg img2.jpg --similarity # Run on GPU/MPS python inference.py --image query.jpg --device cuda ``` ### Python API ```python import open_clip import torch import torch.nn.functional as F from PIL import Image model, _, preprocess = open_clip.create_model_and_transforms( "ViT-B-16-SigLIP", pretrained="path/to/moda-fashion-deepfashion2/open_clip_model.safetensors", ) model.eval() image = preprocess(Image.open("query.jpg")).unsqueeze(0) with torch.no_grad(): features = model.encode_image(image) features = F.normalize(features, p=2, dim=-1) # [1, 768] ``` ### Requirements ``` open_clip_torch>=2.20.0 torch>=2.0 Pillow safetensors ``` ## Training Details - **Base model**: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained) - **Fine-tuned components**: Vision encoder only (image tower) - **Training data**: DeepFashion2 cross-domain shop↔consumer image pairs - **Triplets**: 13,557 train + 714 validation - **Loss**: InfoNCE + L2 weight drift regularization - **Temperature**: 0.07 - **Alignment weight**: 0.3 - **Optimizer**: AdamW, LR=2e-6, batch=24 - **Epochs**: 4 (best at epoch 3, val triplet accuracy = 99.6%) - **BBox cropping**: Uses DeepFashion2 bounding box annotations for item-level crops - **Hardware**: Apple M-series (MPS) ## Why It Works The key insight is **cross-domain contrastive learning**. DeepFashion2 contains pairs of the *same product* photographed in two very different conditions: - **Shop images**: Clean studio photos (white background, centered) - **Consumer images**: In-the-wild photos (varied backgrounds, angles, lighting) Training the vision encoder to match these pairs teaches the model to look past domain differences and focus on the product's intrinsic visual features — exactly what LookBench tests. ## Related Models | Model | Dim | Fine R@1 | Best for | |---|---:|---:|---| | [MODA-Fashion-Distilled](https://huggingface.co/HopitAI/moda-fashion-distilled) | 768 | 67.63 | Best overall quality | | [MODA-Fashion-Matryoshka](https://huggingface.co/HopitAI/moda-fashion-matryoshka) | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index | | [MODA-Fashion-Vision-FP16](https://huggingface.co/HopitAI/moda-fashion-vision-fp16) | 768 | 67.42 | Smallest (186 MB), edge/mobile | | [MODA-Fashion-Distilled-512d](https://huggingface.co/HopitAI/moda-fashion-distilled-512d) | 512 | 67.63 | Compact index, highest nDCG@5 | | **MODA-Fashion-DeepFashion2 (this model)** | 768 | 66.52 | Simplest recipe, no distillation | ## License MIT ## Citation If you use this model, please cite: ``` @software{moda2026, title = {MODA: Open-source benchmark and models for fashion search}, author = {Hopit AI}, year = {2026}, url = {https://github.com/hopit-ai/Moda} } ```