Gemma 4 E4B Legal GRPO — LoRA Adapter

Fine-tuned Google Gemma 4 E4B for legal analysis using GRPO (Group Relative Policy Optimization) reinforcement learning.

Model Details

Property	Value
Base model	`google/gemma-4-E4B-it` (4B effective params)
Quantized base	`unsloth/gemma-4-E4B-it-unsloth-bnb-4bit`
Method	GRPO (RL-based alignment)
LoRA rank	16 (alpha=16, RSLoRA)
Trainable params	42.4M (0.53% of base)
Adapter tensors	588 (language_model only)
Adapter size	140 MB
Training hardware	NVIDIA RTX PRO 6000 Blackwell (95 GB, Colab G4)
Training time	~1.3 hours (250 steps, 1000 prompts)
License	Apache 2.0

Model Description

Legal-domain LoRA adapter trained with GRPO reinforcement learning on 45K+ legal/reasoning examples. Optimized for legal document analysis, statute interpretation, evidence classification, and case law reasoning with proper Bluebook citations.

Developed by: Semaj90
Model type: PEFT LoRA adapter (text generation)
Language: English
License: Apache 2.0
Fine-tuned from: google/gemma-4-E4B-it

Model Sources

Repository: Semaj90/gemma4-e4b-legal-grpo
GRPO Paper: DeepSeekMath (arXiv:2402.03300)

How to Get Started

With Unsloth (recommended)

from unsloth import FastVisionModel
from peft import PeftModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "Semaj90/gemma4-e4b-legal-grpo")
FastVisionModel.for_inference(model)

messages = [{"role": "user", "content": [{"type": "text", "text": "Analyze 42 U.S.C. Section 1983"}]}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
output = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

With Ollama (GGUF)

# See: Semaj90/gemma4-e4b-legal-grpo-GGUF
ollama create gemma4-legal:latest -f Modelfile
ollama run gemma4-legal:latest "What are the elements of negligence?"

Uses

Direct Use

Legal document analysis and case law research
Evidence classification and chain of custody assessment
Statute interpretation (U.S.C., CFR, state codes)
Case law reasoning and precedent analysis
Contract review and liability assessment
Legal AI chatbot / RAG pipeline augmentation

Out-of-Scope Use

Not a substitute for professional legal advice
Not designed for jurisdictions outside U.S. law
Should not be used for automated legal decision-making without human review

Training Details

Training Data

Dataset	Samples	Purpose
FineTome-100k	10,000	General instruction following
GSM8K	5,000	Mathematical reasoning
Pile of Law	20,000	Legal text corpus
LexGLUE CaseHold	5,000	Legal reasoning / holdings
LexGLUE SCOTUS	5,000	Supreme Court opinions
Custom codebase patterns	250	Legal AI system patterns

45K+ examples distilled into 1,000 GRPO prompts (Phase 1 pilot).

Training Procedure

Method: GRPO (Group Relative Policy Optimization) — generates multiple completions per prompt, scores them with reward functions, and updates the policy to favor higher-reward outputs.

Training Hyperparameters

Training regime: BFloat16
Optimizer: AdamW 8-bit
Learning rate: 5e-6 (cosine schedule)
Batch size: 4 (gradient accumulation: 2, effective: 8)
Generations per prompt: 2
Max completion length: 256 tokens
Warmup steps: 30
Weight decay: 0.01
Max grad norm: 0.5
Epochs: 1

Reward Functions

Single combined reward with 5 signals optimized for training throughput:

Signal	Weight	Description
Citation accuracy	0.25	Valid Bluebook, U.S.C., CFR citation formats
Reasoning logic	0.25	Logical connectors (therefore, because, pursuant to)
Legal formatting	0.20	Numbered lists, paragraph structure
Anti-hallucination	0.15	Penalize fabricated case names
Length efficiency	0.15	Target 100-250 words

Speeds, Sizes, Times

Total training time: ~1.3 hours (4,744 seconds)
Steps: 250
Throughput: 0.21 samples/sec
Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB VRAM)
Cloud: Google Colab G4 GPU

Bias, Risks, and Limitations

May hallucinate case citations despite anti-hallucination reward signal
Trained primarily on U.S. federal law — limited state and international coverage
Text-only adapter — vision/audio capabilities of base Gemma 4 pass through unmodified
Phase 1 pilot training (1K prompts, 1 epoch) — production training with full dataset recommended

Adapter Surgery Note

The original training produced 884 tensors (588 language + 224 vision + 72 audio). Despite finetune_vision_layers=False, generic target_modules (q_proj, k_proj, etc.) matched projections across all sub-models. The vision/audio tensors use Gemma4ClippableLinear which PEFT cannot merge (upstream issue).

This adapter has been surgically cleaned to contain only the 588 language_model tensors (140 MB). Vision and audio capabilities pass through from the base model unmodified.

Environmental Impact

Hardware Type: NVIDIA RTX PRO 6000 Blackwell
Hours used: ~1.3
Cloud Provider: Google Colab
Compute Region: US

Technical Specifications

Compute Infrastructure

Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB, Compute 12.0)
Software: Unsloth 2026.4.2, TRL 0.24.0, PEFT 0.18.1, Transformers 5.5.0, PyTorch 2.10.0+cu128

Citation

@misc{gemma4-legal-grpo-2026,
  title={Gemma 4 E4B Legal GRPO Adapter},
  author={Semaj90},
  year={2026},
  url={https://huggingface.co/Semaj90/gemma4-e4b-legal-grpo}
}

Framework Versions

PEFT 0.18.1
Unsloth 2026.4.2
TRL 0.24.0
Transformers 5.5.0
PyTorch 2.10.0+cu128

Downloads last month: 58

Model tree for Semaj90/gemma4-e4b-legal-grpo

Base model

google/gemma-4-E4B-it

Adapter

(53)

this model

Datasets used to train Semaj90/gemma4-e4b-legal-grpo

Paper for Semaj90/gemma4-e4b-legal-grpo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145