July 23, 2025
The GIST New dataset and models boost Portuguese language AI performance to match English
Gaby Clark
scientific editor
Andrew Zinin
lead editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
peer-reviewed publication
trusted source
proofread

Large language models, such as ChatGPT, perform significantly less well in Portuguese than in English despite both languages being spoken worldwide. This gap has now been closed with "GigaVerbo." The team led by Dr. Nicholas Kluge Corrêa from the Center for Science and Thought at the University of Bonn is now presenting the project in the journal Patterns. The researchers were among the first to utilize the new "Marvin" supercomputer at the University of Bonn. Nicholas Kluge Corrêa and his colleague Aniket Sen are both members of the Transdisciplinary Research Area "Sustainable Futures" at the University of Bonn.
GigaVerbo is the name of the dataset developed by the researchers. The project "Tucano: Advancing Neural Text Generation for Portuguese" aims to bridge the resource gap in Portuguese natural language processing (NLP) by providing high-quality datasets and cutting-edge language models specifically designed for the Portuguese language.
The development and release of the GigaVerbo corpus, comprising 200 billion deduplicated tokens, along with the Tucano family of models, aims to foster progress in neural text generation in an open and reproducible manner, promoting equitable access.
The researchers collected several Portuguese corpora from different sources to ensure high linguistic diversity and quality. These corpora were then deduplicated and filtered to form the GigaVerbo dataset. Using this dataset, they trained several decoder models on the Marvin supercomputer, which followed rigorous evaluation and optimization cycles.
The project addresses two major gaps: first, the scarcity of comprehensive open-source resources for Portuguese, a language often overshadowed by resource-rich languages like English. Second, the deficiency in open-source LLM development, which impedes the scientific reproducibility of these models.
The researchers are currently working to scale up their developments in Portuguese by improving their dataset and training larger models. They are also currently developing resources for other low-resource languages, such as Bengali and Hindi, all thanks to Marvin and the University of Bonn.
More information: Nicholas Kluge Corrêa et al, Tucano: Advancing neural text generation for Portuguese, Patterns (2025). DOI: 10.1016/j.patter.2025.101325
Journal information: Patterns Provided by University of Bonn Citation: New dataset and models boost Portuguese language AI performance to match English (2025, July 23) retrieved 23 July 2025 from https://techxplore.com/news/2025-07-dataset-boost-portuguese-language-ai.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
New open-source language model offers multilingual support and public transparency shares
Feedback to editors