--- language: - eng - lug license: apache-2.0 datasets: - reuben256/tekjuice-eng-lug-target metrics: - bleu base_model: - facebook/nllb-200-distilled-600M pipeline_tag: translation library_name: transformers tags: - Luganda - Low-Resource - Seq2Seq - Distilled - Machine Translation - NLLB - AfricaNLP - tekjuice --- # ๐Ÿง  Model Card: `reuben256/nllb-distilled-600-lug` ## ๐ŸŒ Overview `reuben256/nllb-distilled-600-lug` is a fine-tuned version of Meta AIโ€™s [NLLB-200 distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model for **English โ†” Luganda** machine translation. It was developed by **tekjuice AI** ๐Ÿงช to support translation in **low-resource African languages**, specifically Luganda ๐Ÿ‡บ๐Ÿ‡ฌ โ€” a widely spoken Bantu language in Uganda. --- ## ๐Ÿš€ Use Cases This model is designed for: - ๐Ÿ“š Translating educational and public health materials - ๐Ÿ“ฐ Localizing government or NGO communications - ๐Ÿ”ฌ Supporting linguistic and NLP research - ๐Ÿงฉ Enabling cross-lingual tasks via translation (e.g., summarization, QA) --- ## ๐Ÿ“ฆ Training Data Fine-tuned using the dataset [`reuben256/tekjuice-eng-lug-target`](https://huggingface.co/datasets/reuben256/tekjuice-eng-lug-target), which includes: - ๐Ÿ“– Public domain and open-source parallel corpora - ๐ŸŒ Crowdsourced and community-translated sentences - ๐Ÿ—ž๏ธ Aligned media and educational content --- ## ๐Ÿ“Š Evaluation The model was evaluated using the **BLEU** metric ๐Ÿ“˜ to assess n-gram precision. Testing was done on a held-out set with similar domain characteristics as the training data. > โš ๏ธ Note: Human evaluation is recommended for assessing fluency, nuance, and cultural accuracy. --- ## ๐Ÿ—๏ธ Base Model Built on top of: - ๐Ÿงฌ `facebook/nllb-200-distilled-600M` โ€” a distilled multilingual model optimized for **speed** and **low-resource language performance**. --- ## โš ๏ธ Limitations - โŒ May struggle with slang, idioms, and culturally specific phrases - ๐Ÿ“‰ Biases in training data may be reflected in outputs - ๐Ÿ’ก Performance may degrade on out-of-domain or highly technical content --- ## ๐Ÿ”ฎ Future Plans Coming improvements: - ๐Ÿ“ˆ Larger and more diverse datasets - ๐Ÿ” Reverse direction (Luganda โ†’ English) - ๐Ÿฅ Domain-specific fine-tuning (e.g., health, legal) - ๐Ÿง  Quality estimation and confidence scoring --- ## ๐Ÿš€ How to Use ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer MODEL = "reuben256/nllb-distilled-600-lug" tokenizer = AutoTokenizer.from_pretrained(MODEL) model = AutoModelForSeq2SeqLM.from_pretrained(MODEL) tokenizer.src_lang = "eng_Latn" tokenizer.tgt_lang = "lug_Latn" text = "Farmers should plant more trees?" inputs = tokenizer(text, return_tensors="pt") translated_tokens = model.generate(**inputs) print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True))