You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ak3ra/gemma-4-e2b-mixed-sft-fft

Mixed-modality SFT of Gemma 4 E2B on Luganda speech (ASR) and English↔Luganda/Acholi translation. Single full fine-tuned model that transcribes speech and translates text in one set of weights.

Recipe

  • Base: jq/e2b-pretrain-eosmask (Gemma 4 E2B continued-pretraining checkpoint)

  • Method: full fine-tuning, 1 epoch

  • LR: 5e-5, effective batch size 16, max_length 8192

  • Optimizer: adamw_torch

  • Training data (deduped, 16,704 mixed rows):

    • 6 Luganda speech subsets: Sunbird/speech/lug_{commonvoice,fleurs,makbenchmark,makerereradio,salt,waxal} (audio clips >30 s filtered out; 4-step audio augmentation)
    • Sunbird/sunflower-posttrain-data sft_translations (eng/lug/ach)
    • Sunbird/sunflower-posttrain-data sft_instructions (eng/lug/ach)
  • System message (used at training; must be matched at inference):

    You are an assistant that transcribes speech and translates Ugandan languages.
    

Eval results

Selected: last checkpoint (step 1043, epoch 1.0). Loss-based selection ("best by eval_text_loss") was unreliable for this run -- text loss was flat across the last few evals while the model continued resolving repetition collapses on hard audio clips that loss couldn't see.

Translation (Sunbird/sunflower-translation-eval test, 99 ex × 2 langs)

metric value
avg chrF 0.4561
avg BLEU 17.46
lug chrF 0.5104
ach chrF 0.4018

ASR (6 Luganda speech sets, dev[:10] each, 55 examples)

metric value
WER 0.4516
CER 0.1129

WER/CER are per-example averaged (matches scripts/eval_asr.py after PR #20).

Quick start (Colab-ready)

Hardware: ~10 GB bf16 weights. Comfortable on A100 (40 GB) or L4 (24 GB). On a free Colab T4 (16 GB), pass load_in_4bit=True to FastModel.from_pretrained.

Install:

pip install -U "transformers>=5.7" "unsloth>=2026.4" librosa soundfile datasets

Speech recognition (Luganda audio → transcript)

import os
os.environ.setdefault("UNSLOTH_COMPILE", "0")
os.environ.setdefault("TORCHDYNAMO_DISABLE", "1")

import torch
from datasets import load_dataset, Audio
from transformers import AutoProcessor
from unsloth import FastModel

REPO   = "ak3ra/gemma-4-e2b-mixed-sft-fft"
SYSTEM = "You are an assistant that transcribes speech and translates Ugandan languages."

model, _ = FastModel.from_pretrained(
    model_name=REPO,
    max_seq_length=8192,
    load_in_4bit=False,    # set True on a free Colab T4
    fast_inference=False,
)
# Unsloth's loader can return a tokenizer-only object for VLMs; reload the
# multimodal processor explicitly so audio inputs are encoded.
processor = AutoProcessor.from_pretrained(REPO)
model.eval()

# Pull a Luganda sample from SALT dev (any 16 kHz mono audio array works).
ds = load_dataset("Sunbird/speech", name="lug_salt", split="dev[:1]")
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds[0]["audio"]["array"]
print("Reference :", ds[0]["text"])

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
    {"role": "user",   "content": [
        {"type": "audio", "audio": audio},
        {"type": "text",  "text": "Please transcribe this Luganda audio."},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print("Prediction:", processor.tokenizer.decode(gen, skip_special_tokens=True).strip())

Translation (English → Luganda or Acholi)

# Reuses `model`, `processor`, `SYSTEM` from above.
text = "How is the weather today?"
messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
    {"role": "user",   "content": [{"type": "text", "text": f"Translate to Luganda: {text}"}]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print(processor.tokenizer.decode(gen, skip_special_tokens=True).strip())

For Acholi, swap "Translate to Luganda""Translate to Acholi".

Plain transformers fallback

Works, but slightly different attention / KV-cache numerics can flip borderline audio clips into repetition loops, so per-example averaged WER on small (n=10) eval subsets may differ by ~0.05–0.10 from the Unsloth path above. Use Unsloth for benchmarks; this path is fine for casual use.

from transformers import AutoModelForCausalLM, AutoProcessor

model     = AutoModelForCausalLM.from_pretrained(REPO, dtype="bfloat16", device_map="auto")
processor = AutoProcessor.from_pretrained(REPO)
model.eval()
# ... build `messages` and `inputs` exactly as above, then model.generate(...)

Repository

Trained from https://github.com/SunbirdAI/sunflower branch sft-hp-sweeps.

Downloads last month
54
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ak3ra/gemma-4-e2b-mixed-sft-fft

Finetuned
(2)
this model