Instructions to use Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1", trust_remote_code=True, dtype="auto") - sentence-transformers
How to use Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1", trust_remote_code=True) sentences = [ "هذا شخص سعيد", "هذا كلب سعيد", "هذا شخص سعيد جدا", "اليوم هو يوم مشمس" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
- NeoAraBERT-MSA-Synonym-Matryoshka-V1
- ⚡ At a glance
- ✅ Where this model wins (real value)
- ⚠️ Where this model loses (be honest)
- Quick start
- Phase-by-phase Arabic-only training progression
- Train / eval split — honest disclosure
- vs other Arabic sentence-transformer models on Muradif (zero-shot, same pipeline)
- Training recipe
- Limitations
- Acknowledgments
- Built on top of NeoAraBERT_MSA by the Arab Center for Research and Policy Studies (ACRPS) Unit for Research In Arabic Social and Digital Spaces (U4RASD) and the American University of Beirut.
- Citation
- License
- ⚡ At a glance
NeoAraBERT-MSA-Synonym-Matryoshka-V1
The diacritics-aware Arabic sentence-embedding model. Built on top of U4RASD/NeoAraBERT_MSA — the only Arabic encoder backbone whose tokenizer natively represents diacritics — this model is fine-tuned to produce sentence embeddings that preserve lexical-synonym sensitivity in MSA and Classical Arabic.
🎯 Positioning. This is a complementary Arabic embedding model, not a universal replacement. Use it when diacritization matters (classical Arabic, religious texts, Arabic learning content, dictionary/thesaurus apps, lexical-synonym retrieval). For general Arabic STS / retrieval / RAG, models like
Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2currently score higher on STS17/STS22.
⚡ At a glance
| Benchmark | Metric | This model | Best AraBERT-vocab sentence transformer* | Δ |
|---|---|---|---|---|
| Muradif held-out (1,555 unseen synsets, 11,103 rows) | Accuracy | 80.09% | 30.98% (full set) | +49 pts |
| Muradif held-out | Mean cosine margin | 0.102 | ~0.012 | 8.5× |
| Muradif full test (38,554 rows, partly seen — see split note) | Accuracy | 80.64% | 30.98% | +49.7 pts |
| STS17 ar-ar | Spearman × 100 | 69.34 | ~85 | -16 |
| STS22-v2 ar | Spearman × 100 | 45.83 | ~64 | -18 |
* Arabic-Triplet-Matryoshka-V2, Arabert-all-nli-triplet-Matryoshka, GATE-AraBert-V1 — all collapse to ~30% on Muradif because their AraBERTv2 vocab fragments diacritized words into many unrelated subwords. NeoAraBERT_MSA preserves the diacritic signal.
✅ Where this model wins (real value)
- Muradif / synonym-in-context at 80% (vs 30% for AraBERT-vocab models). This is the only public Arabic sentence-transformer that handles diacritized synonym disambiguation well. If you're building MSA dictionary / thesaurus apps, Quran or Hadith retrieval, classical Arabic literature search, or any system where diacritics carry meaning, this gap is decisive.
- Diacritics-aware: the only Arabic embedding model on HF whose backbone tokenizer represents diacritics as first-class tokens. Other Arabic models destroy that signal by tokenization.
- First sentence-embedding head on the NeoAraBERT family — useful research baseline; fork-friendly.
- Matryoshka representation at dims
[768, 512, 256, 128, 64]for storage / latency trade-offs.
⚠️ Where this model loses (be honest)
- General Arabic STS17 ar-ar: 69 vs the ~85 of
Arabic-Triplet-Matryoshka-V2. For short-sentence general similarity, prefer the existing Arabic-Triplet-Matryoshka line. - STS22-v2 ar: 46 vs ~64. Long-document news similarity also trails.
- Cross-lingual: not supported. STS17 en-ar Spearman ~18. NeoAraBERT_MSA was pre-trained on Arabic only — no English subspace.
- Dialectal Arabic untested. Built on the MSA variant. For dialectal use, prefer training on top of
U4RASD/NeoAraBERT_DAorU4RASD/NeoAraBERT(Mix). - Not a drop-in
SentenceTransformer. The custom Arabic morph tokenizer doesn't match the modernAutoProcessorinterface; you must encode via raw HF + mean pooling (snippet below). Requiresxformersandfast-disambig. - Muradif training partially overlaps test (70% of anchor synsets seen). See "Train/eval split" section — held-out is the honest 80.09%; full-set 80.64% is reported for comparability with the NeoAraBERT paper.
Quick start
pip install torch transformers xformers fast-disambig
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
MODEL = "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL, trust_remote_code=True).to(device).eval()
def mean_pool(hidden, mask):
m = mask.unsqueeze(-1).to(hidden.dtype)
return (hidden * m).sum(1) / m.sum(1).clamp(min=1e-9)
@torch.no_grad()
def encode(texts, batch_size=32, max_len=128):
embs = []
for i in range(0, len(texts), batch_size):
batch = [str(t) for t in texts[i:i+batch_size]]
enc = tokenizer(batch, padding=True, truncation=True,
max_length=max_len, return_tensors="pt").to(device)
out = model(**enc)
embs.append(mean_pool(out.last_hidden_state, enc["attention_mask"]).cpu())
return F.normalize(torch.cat(embs, dim=0), dim=-1)
sentences = ["صلاة الجمعة في المسجد", "الصلاة في الجامع", "السباحة في البحر"]
e = encode(sentences)
print((e @ e.T).numpy())
# anchor↔synonym similarity > anchor↔irrelevant
Matryoshka — using shorter embeddings
The model was trained with Matryoshka loss at [768, 512, 256, 128, 64]. Truncate L2-normalized embeddings to any of these sizes and re-normalize:
def truncate(emb, dim):
return F.normalize(emb[:, :dim], dim=-1)
e_64 = truncate(e, 64) # 12× smaller, still useful
e_256 = truncate(e, 256) # great trade-off
Phase-by-phase Arabic-only training progression
| Stage | Data added | STS17 ar-ar | STS22-v2 ar | Muradif (full) | Muradif margin |
|---|---|---|---|---|---|
U4RASD/NeoAraBERT_MSA (raw, no FT, mean pool) |
— | 25.33 | 52.75 | 70.32% | 0.006 |
| Phase 1 — silma triplets | 30K | 55.23 | 39.57 | 64.44% | 0.011 |
| Phase 2 — + Muradif synonym + Mix Triplet | 27K + 14K | 66.13 | 42.45 | 80.35% | 0.103 |
| Phase 4 (this checkpoint) — + Arabic-NLi 200K + MSMARCO-ar 60K + Muradif | 327K | 69.34 | 45.83 | 80.64% | 0.102 |
(A Phase 3 added cross-lingual Opus ar-en but hurt Arabic-only metrics, so it was dropped from the final lineage.)
Train / eval split — honest disclosure
The Muradif dataset is published as a single test split (38,554 rows). To train on Muradif-style synonym triplets while still measuring honest generalization, we split by anchor_word (synset), not by row:
| Split | Unique anchor synsets | Triplets | Used for |
|---|---|---|---|
| Train | 3,631 (70%) | 27,451 | Fine-tuning loss (Phase 2 + Phase 4) |
| Held-out | 1,555 (30%) | 11,103 | Dev metric during training (never in train loss) |
This means 70% of the anchor synsets and their 27,451 rows were seen during training.
| Eval scope | Phase 2 | Phase 4 (this) |
|---|---|---|
| Held-out 11,103 rows (anchor synsets never seen) | 79.73% | 80.09% |
| Full 38,554 rows (includes the 27,451 trained-on rows) | 80.35% | 80.64% |
| Δ between full and held-out | +0.62 | +0.55 |
The full-set score is only ~0.5 points above the held-out score. A memorizing model would show a much larger gap (5-15 points). This 0.5-pt gap means the model generalized the synonym signal — it learned to recognize lexical synonymy, not just memorize specific (context, anchor) pairs.
The honest headline metric is 80.09% (held-out, unseen synsets). The 80.64% full-set number is included for direct comparability with the NeoAraBERT paper's protocol (which reports 86.32% for the raw checkpoint on the full test set).
Why we couldn't fully avoid this overlap
Muradif is built from the Arabic Ontology (SinaLab). The natural alternative — using Arabic Ontology directly to build training triplets — would still risk leakage because the underlying synonym sets are the same. There's no public, separately-curated Arabic-synonym-in-context corpus large enough at training scale. The synset-level split is the cleanest currently feasible approach.
vs other Arabic sentence-transformer models on Muradif (zero-shot, same pipeline)
| Model | Muradif Acc (full) | Margin |
|---|---|---|
| Arabic-Triplet-Matryoshka-V2 | 30.50% | 0.012 |
| Arabert-all-nli-triplet-Matryoshka | 29.68% | 0.009 |
| GATE-AraBert-V1 | 30.98% | 0.012 |
| Raw NeoAraBERT_MSA (mean pool, in-house) | 70.32% | 0.006 |
| NeoAraBERT-MSA-Synonym-Matryoshka-V1 | 80.64% (full) / 80.09% (held-out) | 0.102 |
The 49-point gap to AraBERT-vocab models is structural, not a training advantage: their tokenizer fragments diacritized synonyms into unrelated subwords. No fine-tuning recipe will close that gap on those backbones — it's a vocabulary problem.
Training recipe
Four progressive phases on top of U4RASD/NeoAraBERT_MSA. Each phase initializes from the previous best checkpoint.
Loss
- MultipleNegativesRankingLoss (MNR) with in-batch negatives + hard negatives, scale = 20
- MatryoshkaLoss averaged over dims
[768, 512, 256, 128, 64] - Mean pooling, L2-normalized cosine
Optimizer
- AdamW (β₁=0.9, β₂=0.95, ε=1e-8, weight decay 0.01)
- Cosine LR schedule with 10% warmup
- bf16 mixed precision via
torch.autocast - Gradient clipping at 1.0
- Max sequence length 128
Phase 4 hyperparameters (final checkpoint)
| init from | Phase 2 best |
| LR | 1e-5 |
| batch size | 64 |
| epochs | 1 |
| steps | 5,117 |
| hardware | 1 × NVIDIA RTX 5090 (32 GB) |
| wall-clock | ~30 min |
Limitations
- Single-vector contrastive model. Pushing Muradif beyond ~80-85% likely requires token-level losses or late-interaction (ColBERT-style) approaches.
- Custom dependencies. Requires
xformers(NeoBERT attention) andfast-disambig(Arabic morphological tokenizer). Inference works on CPU but is significantly slower; GPU recommended. - Numerical sensitivity. Trained in bf16 autocast; for absolute reproducibility of evaluation numbers, encode with the same protocol used here (mean pool over
last_hidden_state×attention_mask, L2-normalized cosine). - Cannot be loaded via
SentenceTransformer(...)directly because the custom Arabic morphological tokenizer doesn't match theAutoProcessorinterface in current sentence-transformers versions. Use the raw HF + mean-pool snippet above. - Muradif train/test partial overlap (70% of synsets). See "Train / eval split" section. Use the held-out 80.09% as the honest comparable number.
Acknowledgments
Built on top of NeoAraBERT_MSA by the Arab Center for Research and Policy Studies (ACRPS) Unit for Research In Arabic Social and Digital Spaces (U4RASD) and the American University of Beirut.
Citation
If you use this model, please cite the NeoAraBERT paper (the backbone):
@inproceedings{abou-chakra-etal-2026-neoarabert,
title = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with
Diacritics-Aware Tokenization and POS-Targeted Masking",
author = "Abou Chakra, Chadi and Hamoud, Hadi and Rakan Al Mraikhat, Osama and
Abu Obaida, Qusai and Ballout, Mohamad and Zaraket, Fadi A.",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
year = "2026",
url = "https://acr.ps/neoarabert"
}
And reference this fine-tuned variant:
@misc{neoarabert-msa-synonym-matryoshka-v1,
title = {NeoAraBERT-MSA-Synonym-Matryoshka-V1: A Diacritics-Aware Arabic Sentence-Embedding Model with Synonym Sensitivity},
author = {Omartificial-Intelligence-Space},
year = {2026},
url = {https://huggingface.co/Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1}
}
License
CC BY-SA 4.0 — matching the upstream NeoAraBERT_MSA license.
- Downloads last month
- 57
Model tree for Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1
Unable to build the model tree, the base model loops to the model itself. Learn more.
Evaluation results
- Spearman x 100 on STS17 (ar-ar)self-reported69.340
- Spearman x 100 on STS22-v2 (ar)self-reported45.830
- Accuracy (%) on unseen synsets on Muradif (held-out 30% of anchor synsets)test set self-reported80.090
- Mean cosine margin on Muradif (held-out 30% of anchor synsets)test set self-reported0.102