Bertha: A Portuguese BERT-Small Pre-trained from Scratch
Bertha is a Transformer-based language model following the BERT architecture, trained entirely from scratch using Brazilian Portuguese data. This model was designed to provide a lightweight yet robust foundation for NLP tasks, optimized for high-efficiency CPU execution and low-latency production environments.
Model Details
- Developed by: Leonardo Lessa Aramaki
- Model type: BERT-Small (Encoder-only)
- Language: Portuguese (PT-BR)
- License: CC-BY-4.0 (Inherited from Corpus Carolina)
- Pre-trained from scratch: Yes
Training Hardware & Configuration
The model was trained using a Masked Language Modeling (MLM) objective on high-performance infrastructure.
- Training Corpus: Corpus Carolina (C4AI/USP), a curated and diverse open dataset of Brazilian Portuguese.
- Hardware: NVIDIA H100 GPU.
- Precision: BF16.
- Architecture:
- Hidden size: 512
- Layers: 6 (Small configuration)
- Attention heads: 8
- Final Loss: ~1.9
Intended Use
Bertha is intended to be used as a base model for downstream NLP tasks in Portuguese, such as:
- Named Entity Recognition (NER)
- Text Classification
- Sentiment Analysis
- Information Extraction
How to use
from transformers import pipeline
mask_filler = pipeline("fill-mask", model="Le0ssa/bertha-portuguese-small")
text = "O Brasil é um [MASK] maravilhoso."
results = mask_filler(text)
print(results)
Citation (LaTeX)
To cite this model in your research, please use the following BibTeX entry:
@misc{aramaki2026bertha,
author = {Aramaki, Leonardo Lessa},
title = {Bertha: A Portuguese BERT-Small Pre-trained from Scratch on Corpus Carolina},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{[https://huggingface.co/](https://huggingface.co/)Le0ssa/bertha-portuguese-small}}
}
- Downloads last month
- 20