Euler-Legal-Embedding-V1

Short Description

Euler-Legal-Embedding-V1 is a specialized embedding model for the legal domain, fine-tuned on Qwen/Qwen3-Embedding-8B. It achieves strong performance on legal retrieval and reasoning tasks within the MTEB benchmark.

Model Details

Base Model: Qwen/Qwen3-Embedding-8B
Model Size: ~8B
Embedding Dimension: 4096 (Default for Qwen3-8B)
Max Input Tokens: 1536
Pooling: Last token pooling (Standard for Qwen-Embedding)
Training Data: Legal domain specific dataset (final-data-new-anonymized-grok4-filtered.jsonl)

Usage

sentence-transformers support

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

You can use the model like this:

from sentence_transformers import SentenceTransformer
import torch

# Load the model
# trust_remote_code=True is required for Qwen-based models
model = SentenceTransformer(
    "Mira190/Euler-Legal-Embedding-V1",
    trust_remote_code=True,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2",  # Optional, requires flash-attn installed
    },
)

model.max_seq_length = 1536

sentences = [
    "The plaintiff filed a motion for summary judgment.",
    "The court granted the motion based on lack of genuine dispute of material fact."
]

# No specific prompt is required for this version
embeddings = model.encode(
    sentences,
    normalize_embeddings=True,
    batch_size=16,
    show_progress_bar=True,
)

print(embeddings.shape)
# Output: (2, 4096)

Transformers support

You can also use the model directly with the transformers library:

import torch
from transformers import AutoModel, AutoTokenizer

model_id = "Mira190/Euler-Legal-Embedding-V1"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

sentences = ["This is a legal document.", "This is another legal document."]

# Tokenize sentences
inputs = tokenizer(
    sentences, 
    return_tensors="pt", 
    padding=True, 
    truncation=True, 
    max_length=1536
)

# Move inputs to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    # Last token pooling (Standard for Qwen-Embedding)
    # Note: Qwen embeddings typically use the last hidden state of the last token (EOS or specific token)
    embeddings = outputs.last_hidden_state[:, -1]
    
    # Normalize embeddings
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)
# Output: (2, 4096)

Training Details

The model was fine-tuned using LoRA (Low-Rank Adaptation) via the Swift framework.

Framework: Swift
Loss Function: InfoNCE (Temperature: 0.03)
Batch Size: 4 (per device)
Learning Rate: 2e-5
LoRA Config: Rank 8, Alpha 32, Dropout 0.05

Citation

If you find this model useful, please consider citing:

@misc{euler2025legal,
      title={Euler-Legal-Embedding: Advanced Legal Representation Learning}, 
      author={LawRank Team},
      year={2025},
      publisher={Hugging Face}
}

Downloads last month: 307

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Mira190/Euler-Legal-Embedding-V1

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B