Garment Image Quality Scorer + Feature Extractor

1. What Was Done

Trained a dual-head model on MobileNetV3-Small that:

Quality head: Predicts image quality score (0-1) for garment photos
Embedding head: Produces 128-dim feature vector for garment matching

This model helps the Vestir app select the best representative image when a user uploads multiple photos of the same garment, and groups similar garments together.

2. Starting Model

Backbone: torchvision.models.mobilenet_v3_small (ImageNet-1K pretrained, 576-dim features)
Quality head: 576 -> 128 -> 1 (Sigmoid) - predicts [0, 1] quality score
Embedding head: 576 -> 256 -> 128 (L2-normalized) - for similarity matching
Total params: ~1.2M

3. Training Dataset

Source: ashraq/fashion-product-images-small (HuggingFace)
Synthetic quality labels: Computed from image properties (sharpness via Laplacian variance, brightness balance, contrast via std-dev)
Augmentation: Created degraded versions (blur, darkness, noise) with lower quality scores
Train: 3,000 samples (originals + degraded), Val: 500 samples

4. Validation / Testing

Primary metric: Spearman rank correlation between predicted and actual quality scores
Secondary metric: Mean Absolute Error (MAE) of quality predictions
Spearman correlation: 0.9972
MAE: 0.0091

5. Baseline Performance

No prior quality scoring model existed in the app. Previously used first uploaded image arbitrarily.

6. What Changed to Improve

Learned quality assessment instead of arbitrary first-image selection
Dual-head architecture efficiently shares backbone for both quality and similarity
Synthetic quality labels based on measurable image properties
Degradation augmentation teaches the model to distinguish good from bad images

7. Training Progress

Step	Loss	LR
50	0.0090	9.72e-04
500	0.0024	3.56e-06
1000	0.0018	9.45e-04
1500	0.0014	1.22e-04
2000	0.0012	7.93e-04
2317 (final)	0.0011	8.11e-06

Model Files

model.onnx: Full precision ONNX (4.5 MB)
model_int8.onnx: INT8 quantized (1.3 MB) - for browser deployment
pytorch_model.pt: PyTorch state dict

Training Details

Hardware: NVIDIA GTX 1050 Ti (4GB VRAM)
Optimizer: AdamW (lr=1e-3, wd=0.01)
Loss: MSE on quality scores
Image size: 224x224
Batch size: 32
Training time: ~5 minutes

Downloads last month: 18