NeoAraBERT-MSA-Synonym-Matryoshka-V1

The diacritics-aware Arabic sentence-embedding model. Built on top of U4RASD/NeoAraBERT_MSA — the only Arabic encoder backbone whose tokenizer natively represents diacritics — this model is fine-tuned to produce sentence embeddings that preserve lexical-synonym sensitivity in MSA and Classical Arabic.

🎯 Positioning. This is a complementary Arabic embedding model, not a universal replacement. Use it when diacritization matters (classical Arabic, religious texts, Arabic learning content, dictionary/thesaurus apps, lexical-synonym retrieval). For general Arabic STS / retrieval / RAG, models like Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 currently score higher on STS17/STS22.


⚡ At a glance

Benchmark Metric This model Best AraBERT-vocab sentence transformer* Δ
Muradif held-out (1,555 unseen synsets, 11,103 rows) Accuracy 80.09% 30.98% (full set) +49 pts
Muradif held-out Mean cosine margin 0.102 ~0.012 8.5×
Muradif full test (38,554 rows, partly seen — see split note) Accuracy 80.64% 30.98% +49.7 pts
STS17 ar-ar Spearman × 100 69.34 ~85 -16
STS22-v2 ar Spearman × 100 45.83 ~64 -18

* Arabic-Triplet-Matryoshka-V2, Arabert-all-nli-triplet-Matryoshka, GATE-AraBert-V1 — all collapse to ~30% on Muradif because their AraBERTv2 vocab fragments diacritized words into many unrelated subwords. NeoAraBERT_MSA preserves the diacritic signal.


✅ Where this model wins (real value)

  1. Muradif / synonym-in-context at 80% (vs 30% for AraBERT-vocab models). This is the only public Arabic sentence-transformer that handles diacritized synonym disambiguation well. If you're building MSA dictionary / thesaurus apps, Quran or Hadith retrieval, classical Arabic literature search, or any system where diacritics carry meaning, this gap is decisive.
  2. Diacritics-aware: the only Arabic embedding model on HF whose backbone tokenizer represents diacritics as first-class tokens. Other Arabic models destroy that signal by tokenization.
  3. First sentence-embedding head on the NeoAraBERT family — useful research baseline; fork-friendly.
  4. Matryoshka representation at dims [768, 512, 256, 128, 64] for storage / latency trade-offs.

⚠️ Where this model loses (be honest)

  1. General Arabic STS17 ar-ar: 69 vs the ~85 of Arabic-Triplet-Matryoshka-V2. For short-sentence general similarity, prefer the existing Arabic-Triplet-Matryoshka line.
  2. STS22-v2 ar: 46 vs ~64. Long-document news similarity also trails.
  3. Cross-lingual: not supported. STS17 en-ar Spearman ~18. NeoAraBERT_MSA was pre-trained on Arabic only — no English subspace.
  4. Dialectal Arabic untested. Built on the MSA variant. For dialectal use, prefer training on top of U4RASD/NeoAraBERT_DA or U4RASD/NeoAraBERT (Mix).
  5. Not a drop-in SentenceTransformer. The custom Arabic morph tokenizer doesn't match the modern AutoProcessor interface; you must encode via raw HF + mean pooling (snippet below). Requires xformers and fast-disambig.
  6. Muradif training partially overlaps test (70% of anchor synsets seen). See "Train/eval split" section — held-out is the honest 80.09%; full-set 80.64% is reported for comparability with the NeoAraBERT paper.

Quick start

pip install torch transformers xformers fast-disambig
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

MODEL = "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL, trust_remote_code=True).to(device).eval()

def mean_pool(hidden, mask):
    m = mask.unsqueeze(-1).to(hidden.dtype)
    return (hidden * m).sum(1) / m.sum(1).clamp(min=1e-9)

@torch.no_grad()
def encode(texts, batch_size=32, max_len=128):
    embs = []
    for i in range(0, len(texts), batch_size):
        batch = [str(t) for t in texts[i:i+batch_size]]
        enc = tokenizer(batch, padding=True, truncation=True,
                        max_length=max_len, return_tensors="pt").to(device)
        out = model(**enc)
        embs.append(mean_pool(out.last_hidden_state, enc["attention_mask"]).cpu())
    return F.normalize(torch.cat(embs, dim=0), dim=-1)

sentences = ["صلاة الجمعة في المسجد", "الصلاة في الجامع", "السباحة في البحر"]
e = encode(sentences)
print((e @ e.T).numpy())
# anchor↔synonym similarity > anchor↔irrelevant

Matryoshka — using shorter embeddings

The model was trained with Matryoshka loss at [768, 512, 256, 128, 64]. Truncate L2-normalized embeddings to any of these sizes and re-normalize:

def truncate(emb, dim):
    return F.normalize(emb[:, :dim], dim=-1)

e_64  = truncate(e, 64)   # 12× smaller, still useful
e_256 = truncate(e, 256)  # great trade-off

Phase-by-phase Arabic-only training progression

Stage Data added STS17 ar-ar STS22-v2 ar Muradif (full) Muradif margin
U4RASD/NeoAraBERT_MSA (raw, no FT, mean pool) 25.33 52.75 70.32% 0.006
Phase 1 — silma triplets 30K 55.23 39.57 64.44% 0.011
Phase 2 — + Muradif synonym + Mix Triplet 27K + 14K 66.13 42.45 80.35% 0.103
Phase 4 (this checkpoint) — + Arabic-NLi 200K + MSMARCO-ar 60K + Muradif 327K 69.34 45.83 80.64% 0.102

(A Phase 3 added cross-lingual Opus ar-en but hurt Arabic-only metrics, so it was dropped from the final lineage.)


Train / eval split — honest disclosure

The Muradif dataset is published as a single test split (38,554 rows). To train on Muradif-style synonym triplets while still measuring honest generalization, we split by anchor_word (synset), not by row:

Split Unique anchor synsets Triplets Used for
Train 3,631 (70%) 27,451 Fine-tuning loss (Phase 2 + Phase 4)
Held-out 1,555 (30%) 11,103 Dev metric during training (never in train loss)

This means 70% of the anchor synsets and their 27,451 rows were seen during training.

Eval scope Phase 2 Phase 4 (this)
Held-out 11,103 rows (anchor synsets never seen) 79.73% 80.09%
Full 38,554 rows (includes the 27,451 trained-on rows) 80.35% 80.64%
Δ between full and held-out +0.62 +0.55

The full-set score is only ~0.5 points above the held-out score. A memorizing model would show a much larger gap (5-15 points). This 0.5-pt gap means the model generalized the synonym signal — it learned to recognize lexical synonymy, not just memorize specific (context, anchor) pairs.

The honest headline metric is 80.09% (held-out, unseen synsets). The 80.64% full-set number is included for direct comparability with the NeoAraBERT paper's protocol (which reports 86.32% for the raw checkpoint on the full test set).

Why we couldn't fully avoid this overlap

Muradif is built from the Arabic Ontology (SinaLab). The natural alternative — using Arabic Ontology directly to build training triplets — would still risk leakage because the underlying synonym sets are the same. There's no public, separately-curated Arabic-synonym-in-context corpus large enough at training scale. The synset-level split is the cleanest currently feasible approach.


vs other Arabic sentence-transformer models on Muradif (zero-shot, same pipeline)

Model Muradif Acc (full) Margin
Arabic-Triplet-Matryoshka-V2 30.50% 0.012
Arabert-all-nli-triplet-Matryoshka 29.68% 0.009
GATE-AraBert-V1 30.98% 0.012
Raw NeoAraBERT_MSA (mean pool, in-house) 70.32% 0.006
NeoAraBERT-MSA-Synonym-Matryoshka-V1 80.64% (full) / 80.09% (held-out) 0.102

The 49-point gap to AraBERT-vocab models is structural, not a training advantage: their tokenizer fragments diacritized synonyms into unrelated subwords. No fine-tuning recipe will close that gap on those backbones — it's a vocabulary problem.


Training recipe

Four progressive phases on top of U4RASD/NeoAraBERT_MSA. Each phase initializes from the previous best checkpoint.

Loss

  • MultipleNegativesRankingLoss (MNR) with in-batch negatives + hard negatives, scale = 20
  • MatryoshkaLoss averaged over dims [768, 512, 256, 128, 64]
  • Mean pooling, L2-normalized cosine

Optimizer

  • AdamW (β₁=0.9, β₂=0.95, ε=1e-8, weight decay 0.01)
  • Cosine LR schedule with 10% warmup
  • bf16 mixed precision via torch.autocast
  • Gradient clipping at 1.0
  • Max sequence length 128

Phase 4 hyperparameters (final checkpoint)

init from Phase 2 best
LR 1e-5
batch size 64
epochs 1
steps 5,117
hardware 1 × NVIDIA RTX 5090 (32 GB)
wall-clock ~30 min

Limitations

  • Single-vector contrastive model. Pushing Muradif beyond ~80-85% likely requires token-level losses or late-interaction (ColBERT-style) approaches.
  • Custom dependencies. Requires xformers (NeoBERT attention) and fast-disambig (Arabic morphological tokenizer). Inference works on CPU but is significantly slower; GPU recommended.
  • Numerical sensitivity. Trained in bf16 autocast; for absolute reproducibility of evaluation numbers, encode with the same protocol used here (mean pool over last_hidden_state × attention_mask, L2-normalized cosine).
  • Cannot be loaded via SentenceTransformer(...) directly because the custom Arabic morphological tokenizer doesn't match the AutoProcessor interface in current sentence-transformers versions. Use the raw HF + mean-pool snippet above.
  • Muradif train/test partial overlap (70% of synsets). See "Train / eval split" section. Use the held-out 80.09% as the honest comparable number.

Acknowledgments

Built on top of NeoAraBERT_MSA by the Arab Center for Research and Policy Studies (ACRPS) Unit for Research In Arabic Social and Digital Spaces (U4RASD) and the American University of Beirut.

Citation

If you use this model, please cite the NeoAraBERT paper (the backbone):

@inproceedings{abou-chakra-etal-2026-neoarabert,
  title  = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with
            Diacritics-Aware Tokenization and POS-Targeted Masking",
  author = "Abou Chakra, Chadi and Hamoud, Hadi and Rakan Al Mraikhat, Osama and
            Abu Obaida, Qusai and Ballout, Mohamad and Zaraket, Fadi A.",
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  year   = "2026",
  url    = "https://acr.ps/neoarabert"
}

And reference this fine-tuned variant:

@misc{neoarabert-msa-synonym-matryoshka-v1,
  title  = {NeoAraBERT-MSA-Synonym-Matryoshka-V1: A Diacritics-Aware Arabic Sentence-Embedding Model with Synonym Sensitivity},
  author = {Omartificial-Intelligence-Space},
  year   = {2026},
  url    = {https://huggingface.co/Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1}
}

License

CC BY-SA 4.0 — matching the upstream NeoAraBERT_MSA license.

Downloads last month
57
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1

Unable to build the model tree, the base model loops to the model itself. Learn more.

Evaluation results