arxiv:2603.03543

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Published on Mar 3

Polyglot

Upvote

Authors:

Nicholas Kluge Corrêa ,

Aniket Sen ,

Shiza Fatimah ,

Sophia Falk ,

Abstract

Tucano 2 is an open-source suite of Portuguese language models with varied parameter counts, enhanced datasets, and comprehensive evaluation methods for improved language understanding and generation.

AI-generated summary

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.

View arXiv page View PDF Project page GitHub 8 Add to collection

Community

JLouisBiz

about 3 hours ago

First, a quick history lesson

The term Free Software was coined by Richard Stallman in the 1980s. He founded the Free Software Foundation (FSF) and the GNU Project https://www.gnu.org -- For Stallman, “free” meant liberty, not price. The four essential freedoms:

Run the program for any purpose
Study and change it (requires source code)
Redistribute copies
Distribute modified versions

Later, in 1998, the term Open Source was coined by Christine Peterson and promoted by Eric Raymond and Bruce Perens (who founded the Open Source Initiative). The goal was to pitch the same software freedoms to businesses using a less “political” term.

Here’s the catch: “Open Source” is a brand. The OSI maintains the Open Source Definition, which is essentially the same as the Debian Free Software Guidelines—and it still requires no discrimination against fields of endeavor (i.e., commercial use allowed) and no restrictions on redistribution.

So whether you call it Free Software or Open Source, the licenses must meet those criteria.

What Tucano 2 does

The article says:

“Tucano 2 is a fully open suite of large language models”

But the training data they used (GigaVerbo) includes licenses I pointed out earlier—specifically:

Corpus Carolina – CC BY-NC-SA (non-commercial)
Roots Ted Talks – CC BY-NC-ND (non-commercial + no derivatives)
Instruct-PTBR – LLAMA 2 Community License (restrictive commercial terms)
Bactrian-X – CC BY-NC
BrWaC – License unknown

If the model was trained on non-commercially licensed data, then you cannot freely use that model commercially without violating the original dataset licenses. That means the model itself is not open source—because one of the core freedoms (commercial use) is effectively blocked.

Why this is misleading

I need just 3 seconds to find out it is not in category of freedom as in free software or so called "open source".

You look at the data licenses, and within seconds you see:

NC = Non-Commercial = not open source
ND = No Derivatives = not open source
LLAMA 2 license = not open source
Unknown license = not open source

Yet the authors confidently say “fully open.”

This is not just a terminology nitpick. It has real legal and ethical implications:

Legal risk: Someone who downloads the model thinking it’s “open source” might build a commercial product, then get sued by the original dataset rights holders (e.g., the creators of Corpus Carolina or Roots Ted Talks).
False advertising: If you’re an organization with compliance requirements, you rely on accurate licensing claims. “Fully open” is a material representation.
Dilutes the meaning: When researchers misuse “open source,” it makes it harder for everyone to trust the term. The OSI has been fighting this for years—see their “Open Source AI Definition” efforts.

What youshould have said

A more honest framing would be:

“Tucano 2 is a suite of models trained on a mix of permissive, non-commercial, and custom-licensed data. While the models themselves are released under [whatever license they chose], downstream commercial use may be restricted due to the original dataset licenses. Use at your own risk.”

Or they could have curated a truly open dataset using only CC0, CC BY, Apache 2.0, MIT—like the free software community has been urging for years.

Bottom line

Calling something “open source” when it’s built on non-commercial or restrictive data is like baking a cake with stolen eggs and calling it “homemade.” The output might be great, but the ingredients tell the real story.

Stallman and the FSF started this movement so that freedom was the point, not a marketing buzzword. The OSI carries that torch today. If we don’t hold the line on what “open source” actually means, the term becomes meaningless—and users pay the price in legal risk and lost freedom.

So yeah. Three seconds. One license list. Not open source.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.03543

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 26

Browse 26 models citing this paper

Datasets citing this paper 9

Browse 9 datasets citing this paper