How open-source tools are helping European Portuguese overcome digital exclusion
en-GBde-DEes-ESfr-FR

How open-source tools are helping European Portuguese overcome digital exclusion

17/06/2025 INESC Brussels HUB

In Portugal, researchers developed open-source tools to bridge a significant gap in natural language processing (NLP) resources for European Portuguese, that is far behind Brazilian Portuguese in digital presence.

When we step into the Mirror World, we find a reflection of our experience outside of it: just like offline, the internet mainly “speaks” in a handful of languages. English, Spanish, and Mandarin together account for about a third of the global population – and this seamlessly translates to the virtual realm: these languages have become “high-resource” languages online, supported by large speaker bases, extensive datasets, and significant research investment.

However, this imbalance disadvantages most of the world’s population. Languages are vehicles of culture, identity and history. Without natural language processing (NLP) tools that help computers “speak” the way we do, less dominant ones risk marginalization in digital environments. European Portuguese is a clear example. Although this Romance language is spoken by over 250 million people worldwide — making it the fourth most spoken native language — there is a clear dominant variant.

Brazil accounts for more than 200 million worldwide Portuguese speakers. Without AI models capable of understanding and generating European Portuguese, this variant loses strength, as it has limited resources and a relatively small NLP community. For instance, right now, general large language models (LLMs) like GPT-4 (which powers ChatGPT) produce Portuguese text heavily influenced by Brazilian terms and cultural references.

This is precisely what the PTICOLA project aims to prevent. Developed by the Portuguese Research and Development (R&D) Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), the project’s primary goal is to expand and improve NLP resources for the Portuguese language, focusing on European Portuguese.

PTICOLA has already delivered two open-source tools: a variety identifier that can distinguish between European Portuguese (PT-PT) and Brazilian Portuguese (PT-BR) and a translation model from English to European Portuguese.

“One of the goals is to aid in identifying European Portuguese texts to improve the training process of European Portuguese LLMs” explained Nuno Guimarães, co-Pi of the project and researcher at INESC TEC. “Here’s the challenge: language models are trained on vast datasets from the internet and open repositories, yet most texts lack clear distinctions between European and Brazilian Portuguese. Without this differentiation, models struggle to capture the unique nuances of each variant.”

Using Google Cloud Platform (GCP) Products, the team translated multiple English and Portuguese datasets to boost low-resource NLP tasks, including Temporal Information Extraction, Semantic Role Labeling, and Relation Extraction.

PTICOLA, funded by the Portuguese government through the Foundation for Science and Technology, also provided decision-making support in different scenarios by developing domain-specific tools, including a clinical case retrieval and ranking system and an English-Portuguese biomedical translator.

The Portuguese paradox

Language models trained in Brazilian Portuguese often fail to meet the needs of speakers in Portugal and other Lusophone countries. The differences go beyond spelling; AI-generated text can feel “foreign” to European Portuguese speakers.

This Portuguese paradox highlighted the need for an English-to-European Portuguese translator – something that PTICOLA tackled. Together with the identifier, this a pioneering achievement: these tools not only improve machine translation by analyzing specific terms and providing confidence scores to flag possible errors but also play a strategic role, distinguishing Portuguese variants to enable better AI training.

The “Portuguese ChatGPT” advances

“We have to avoid being too reliant on companies to invest in a variety like European Portuguese”, said Nuno. Just like any other language, Portuguese is increasingly mediated by technology. And if this mediation continues to be dominated by a few tech companies, it may impose risks and limitations on communication, digital citizenship, and cultural autonomy.

Those were also some the reasons evoked by the national government to launch the so-called “Portuguese ChatGPT”, Amália, an effort to develop LLM in European Portuguese. Months before the announcement of Amália, the scientific community specializing in AI and Portuguese language technology warned about “unprecedented challenges arising from a civilizational transformation driven by a technological shock of unparalleled scale.”

They called on national authorities to design and implement a technological readiness plan for the Portuguese language in the AI era, with informed support from the scientific community. “For democratizing this technology, such a plan must foster the development and open access to open-source solutions for Portuguese language technology, respecting necessary regulations”, it read.

Nuno added: “If we neglect this, in an increasingly digital world, when we already tend to use English in our day to day, we risk losing the cultural aspect of the language.”

PTICOLA builds on a long-standing line of research at INESC TEC, involving doctoral and master’s students focused on creating resources for European Portuguese. For example, the CitiLink project, coordinated by the Portuguese institute, is developing AI algorithms based on NLP to interpret and summarize minutes from municipal meetings. This technology enables the identification of key events discussed, organizes them by departments, and highlights the positions taken by each municipal councilor, thereby making public information more accessible and transparent.
Attached files
  • pticolafoto.jpg
17/06/2025 INESC Brussels HUB
Regions: Europe, Portugal
Keywords: Applied science, Computing, Artificial Intelligence, Humanities, Linguistics

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement