Bertha: A Portuguese BERT-Small Pre-trained from Scratch

Bertha is a Transformer-based language model following the BERT architecture, trained entirely from scratch using Brazilian Portuguese data. This model was designed to provide a lightweight yet robust foundation for NLP tasks, optimized for high-efficiency CPU execution and low-latency production environments.

Model Details

Developed by: Leonardo Lessa Aramaki
Model type: BERT-Small (Encoder-only)
Language: Portuguese (PT-BR)
License: CC-BY-4.0 (Inherited from Corpus Carolina)
Pre-trained from scratch: Yes

Training Hardware & Configuration

The model was trained using a Masked Language Modeling (MLM) objective on high-performance infrastructure.

Training Corpus: Corpus Carolina (C4AI/USP), a curated and diverse open dataset of Brazilian Portuguese.
Hardware: NVIDIA H100 GPU.
Precision: BF16.
Architecture:
- Hidden size: 512
- Layers: 6 (Small configuration)
- Attention heads: 8
Final Loss: ~1.9

Intended Use

Bertha is intended to be used as a base model for downstream NLP tasks in Portuguese, such as:

Named Entity Recognition (NER)
Text Classification
Sentiment Analysis
Information Extraction

How to use

from transformers import pipeline

mask_filler = pipeline("fill-mask", model="Le0ssa/bertha-portuguese-small")
text = "O Brasil é um [MASK] maravilhoso."
results = mask_filler(text)
print(results)

Citation (LaTeX)

To cite this model in your research, please use the following BibTeX entry:

@misc{aramaki2026bertha,
  author = {Aramaki, Leonardo Lessa},
  title = {Bertha: A Portuguese BERT-Small Pre-trained from Scratch on Corpus Carolina},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/](https://huggingface.co/)Le0ssa/bertha-portuguese-small}}
}

Downloads last month: 20

Safetensors

Model size

28.5M params

Tensor type

F32