USA Immigration Law β€” Llama 3.2 3B

Fine-tuned from meta-llama/Llama-3.2-3B-Instruct on the nshportun/usa-immigration-law-qa dataset β€” 17,058 source-grounded Q&A pairs covering all major U.S. immigration subdomains.

Training Details

Setting Value
Base model Llama 3.2 3B Instruct
Method LoRA (r=8, alpha=32, merged into base weights)
Training pairs 16,065
Eval pairs 993 (stratified across 13 subdomains)
Epochs 1
Batch size 1 per device (int8 quantization)
Learning rate 1e-4
Max input length 512 tokens
Infrastructure AWS SageMaker ml.g5.2xlarge (24GB VRAM)
Train loss 0.894
Eval loss 0.903
Eval perplexity 2.47

Benchmark Results

Evaluated on a stratified random sample of 101 questions across all 13 immigration subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge (Claude Sonnet 4.6) against reference answers from official sources.

Scoring scale: 0 = wrong/hallucinated Β· 1 = partially correct Β· 2 = mostly correct Β· 3 = fully correct

Evaluation date: 2026-05-17
Judge model: us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
Eval set source: nshportun/usa-immigration-law-qa, split=eval, seed=42
Fine-tuned model inference: local CPU (transformers 5.8.1, bfloat16, device_map=cpu)

Overall Scores

Model Mean Score (0–3) % Fully Correct (score=3) N
Llama 3.2 3B fine-tuned (this model) 0.68 7.9% 101
Claude Sonnet 4.6 zero-shot 1.47 25.7% 101
Llama 3 8B zero-shot (base family) 0.80 2.0% 101

Why baselines matter: Claude Sonnet 4.6 is a frontier model 100x larger than this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these domain-specific questions, establishing the difficulty of the task. The fine-tuned 3B model achieves 7.9% fully-correct β€” outperforming the zero-shot 8B baseline on that metric despite being 2.7x smaller.

By Subdomain β€” Llama 3.2 3B Fine-tuned (this model)

Subdomain Mean Score % Fully Correct N
Travel documents 1.83 33.3% 6
Naturalization 1.13 25.0% 8
Statistics 1.13 12.5% 8
Appeals 1.00 0.0% 3
Nonimmigrant visas 0.88 12.5% 8
Adjustment of status 0.75 0.0% 8
Employment authorization 0.75 12.5% 8
Asylum 0.50 12.5% 8
Admissibility 0.38 0.0% 8
Family-based immigration 0.38 0.0% 8
Humanitarian 0.38 0.0% 8
Removal 0.38 0.0% 8
General 0.25 0.0% 8
Employment-based (EB) 0.00 0.0% 4

By Subdomain β€” Claude Sonnet 4.6 Zero-Shot Baseline

Subdomain Mean Score % Fully Correct N
Travel documents 2.33 33.3% 6
Adjustment of status 2.25 62.5% 8
Humanitarian 2.13 50.0% 8
Asylum 2.00 50.0% 8
Admissibility 1.50 25.0% 8
Naturalization 1.50 25.0% 8
Nonimmigrant visas 1.50 25.0% 8
Family-based immigration 1.13 12.5% 8
Removal 1.25 12.5% 8
Statistics 1.25 12.5% 8
Appeals 1.00 0.0% 3
Employment authorization 0.75 12.5% 8
Employment-based (EB) 0.75 25.0% 4
General 0.75 0.0% 8

By Subdomain β€” Llama 3 8B Zero-Shot Baseline

Subdomain Mean Score % Fully Correct N
Adjustment of status 1.25 0.0% 8
Travel documents 1.17 0.0% 6
Asylum 1.13 12.5% 8
Removal 0.88 0.0% 8
Statistics 0.88 0.0% 8
Humanitarian 0.75 12.5% 8
Naturalization 0.75 0.0% 8
Admissibility 0.75 0.0% 8
Nonimmigrant visas 0.75 0.0% 8
Employment authorization 0.63 0.0% 8
General 0.63 0.0% 8
Employment-based (EB) 0.50 0.0% 4
Family-based immigration 0.50 0.0% 8
Appeals 0.33 0.0% 3

Key Observations

  • The task is genuinely hard: Even Claude Sonnet 4.6 (a frontier model) scores only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific, citation-level precision required by immigration procedural questions.
  • Fine-tuning boosts fully-correct rate: The 3B fine-tuned model achieves 7.9% fully-correct vs. 2.0% for the zero-shot 8B base β€” a 4x improvement on exact correctness despite being 2.7x smaller, with 1 epoch of domain training.
  • Strongest subdomains for fine-tuned model: travel documents (1.83), naturalization (1.13), statistics (1.13) β€” procedural topics well-represented in training data.
  • Weakest subdomains: employment-based (0.00), general (0.25), removal (0.38) β€” topics requiring cross-referencing multiple USCIS form instructions or policy details.
  • Room for improvement: The fine-tuned model's mean (0.68) is below the zero-shot 8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs more specific instruction tuning rather than completion-style fine-tuning.

Reproducing the Benchmark

# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt

# Set environment variables (AWS Bedrock for baseline models + judge)
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...

# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py

# Run fine-tuned model inference on CPU (requires model artifacts locally)
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py

# Results written to:
#   data_local/benchmark/results.jsonl  (per-question scores)
#   data_local/benchmark/summary.json   (aggregate table)

The benchmark script supports resume β€” it skips already-scored questions. random.seed(42) ensures the same 101-question sample is selected each run.

Immigration Subdomains Covered

Subdomain QA Pairs
Family-based immigration ~3,987
Naturalization ~2,670
Asylum ~2,094
Adjustment of status ~1,727
Removal ~1,277
Humanitarian ~894
Employment authorization ~832
Admissibility ~553
Nonimmigrant visas ~548
Travel documents ~109
Employment-based (EB) ~74
Appeals ~66
Statistics ~141

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nshportun/usa-immigration-llama-3.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
    {"role": "user", "content": "What is the filing fee for Form I-485?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Data Sources

  • USCIS Policy Manual β€” primary_official
  • USCIS Forms & Instructions (I-130, I-485, I-765, N-400, I-589...) β€” primary_official
  • 8 CFR / INA statute text β€” primary_official
  • BIA Precedent Decisions β€” primary_official
  • harshitha008/US-immigration-laws (Apache 2.0) β€” secondary_reputable
  • Law StackExchange immigration posts β€” community

Intended Use

  • RAG-based immigration legal assistants
  • Domain-specific LLM benchmarking
  • Immigration law Q&A research

Disclaimer

This model is for research and educational purposes only. It does not constitute legal advice. Immigration law is complex and changes frequently β€” always consult a licensed immigration attorney.

Downloads last month
150
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nshportun/usa-immigration-llama-3.2-3b

Adapter
(761)
this model

Dataset used to train nshportun/usa-immigration-llama-3.2-3b