Bertha: A Portuguese BERT-Small Pre-trained from Scratch

Bertha is a Transformer-based language model following the BERT architecture, trained entirely from scratch using Brazilian Portuguese data. This model was designed to provide a lightweight yet robust foundation for NLP tasks, optimized for high-efficiency CPU execution and low-latency production environments.

Model Details

  • Developed by: Leonardo Lessa Aramaki
  • Model type: BERT-Small (Encoder-only)
  • Language: Portuguese (PT-BR)
  • License: CC-BY-4.0 (Inherited from Corpus Carolina)
  • Pre-trained from scratch: Yes

Training Hardware & Configuration

The model was trained using a Masked Language Modeling (MLM) objective on high-performance infrastructure.

  • Training Corpus: Corpus Carolina (C4AI/USP), a curated and diverse open dataset of Brazilian Portuguese.
  • Hardware: NVIDIA H100 GPU.
  • Precision: BF16.
  • Architecture:
    • Hidden size: 512
    • Layers: 6 (Small configuration)
    • Attention heads: 8
  • Final Loss: ~1.9

Intended Use

Bertha is intended to be used as a base model for downstream NLP tasks in Portuguese, such as:

  • Named Entity Recognition (NER)
  • Text Classification
  • Sentiment Analysis
  • Information Extraction

How to use

from transformers import pipeline

mask_filler = pipeline("fill-mask", model="Le0ssa/bertha-portuguese-small")
text = "O Brasil é um [MASK] maravilhoso."
results = mask_filler(text)
print(results)

Citation (LaTeX)

To cite this model in your research, please use the following BibTeX entry:

@misc{aramaki2026bertha,
  author = {Aramaki, Leonardo Lessa},
  title = {Bertha: A Portuguese BERT-Small Pre-trained from Scratch on Corpus Carolina},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/](https://huggingface.co/)Le0ssa/bertha-portuguese-small}}
}
Downloads last month
20
Safetensors
Model size
28.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support