Inovation

Revolutionizing European AI: The Impact of High-Performance Language Models

Published

2 months ago

March 6, 2026

High-performance large language models for Europe

The High-Performance Language Technologies (HPLT) project is developing very large-scale multilingual resources for large language models and machine translation.

Massive text collections for pre-training are the ‘crude oil’ of the large language model (LLM) era. The process of ‘refining’ high-quality datasets from web data at scale presupposes computational infrastructure and technological muscle that is often characteristic of corporate environments, as evidenced, for example, by some notable generally available pre-training datasets: C4,¹ FineWeb 1 & 2,^2,3 MADLAD-400,⁴ or Nemotron-CC.⁵ With a few notable exceptions, this line of work tends to capitalise on the English language.

Here, we present the open-source results^6,9,10 of the European R&D consortium HPLT – a project that has been funded under the auspices of the Horizon Europe programme in 2022–2025. Together with a myriad of additional results, HPLT has produced massive pre-training datasets of high-quality texts in close to 200 distinct language–script combinations. Its 2025 monolingual data release, HPLT 3.0, comprises some 30 trillion sub-word tokens in total, of which close to half represent languages other than English. We make this resource publicly available under the most permissive terms of use possible. We further share a state-of-the-art and open-source data preparation pipeline, an innovative multilingual evaluation framework, as well as hundreds of language models pre-trained on HPLT data.

Fig. 1

Furthermore, the project has produced novel bilingual datasets for more than 50 language pairs, hundreds of associated machine translation models, open-source pipelines for data preparation, model training, and evaluation, as well as synthesised additional pre-training data for underrepresented languages by machine translation of very high-quality English documents. In our view, it is the totality of generally available and very large-scale resources and the documentation of the underlying processes that bears promise of ‘democratising’ the current LLM and MT landscape.

Organisation

The HPLT consortium comprised partners from five different universities (Charles University in Prague and the Universities of Edinburgh, Helsinki, Oslo, and Turku), two national HPC centres (CESNET in the Czech Republic and Sigma2 in Norway), and a language engineering company (Prompsit) from all around Europe. The project has received about €4.1m from the Horizon Europe programme and £960,000 from UK Research and Innovation, and ran from September 2022 through December 2025. The project was coordinated by Jan Hajič (Charles University), with technical coordination by Kenneth Heafield (Edinburgh) and Stephan Oepen (Oslo) in its first and second halves, respectively.

Data curation

HPLT has gathered and processed more than ten petabytes of raw web data. The project has released more than 30 billion tokens (word-like units) of high-quality textual data, accompanied by rich metadata, for close to 200 distinct languages. The process of extracting, cleaning, annotating, and filtering texts from raw web archives is schematically depicted in Fig. 1, composed of about a dozen modules.

Raw web archives were drawn from three sources: the Internet Archive (IA), host of the iconic Wayback Machine); the non-profit Common Crawl Foundation (CC); and the ArchiveBot volunteer infrastructure for long-term web archiving. Sub-tasks like, for example, the extraction of ‘running text’ from marked-up document formats, language identification at the document and paragraph levels, ‘fuzzy’ near-deduplication, annotation with a wealth of text quality and regulatory compliance signals, and final filtering based on all available information, each directly impact the practical utility of the final data sets. Here, text quality versus overall volume present separate and typically antithetical dimensions for optimisation, creating a rich space for different design choices and trade-offs. This remains an active area of research. The open-source HPLT processing pipelines are highly flexible and parameterisable, where default values represent the current state of knowledge.

Monolingual statistics

To put the HPLT monolingual data into perspective, Table 1 (below) presents document and token counts (see note) for the English and multilingual (non-English) partitions of the data, as well as counts for a small sample of individual languages. For ease of comparison, these statistics are accompanied with average document lengths and per-language proportions, and contrasted with corresponding figures for three other publicly available multilingual datasets mentioned above.

Table 1: Note: For the purpose of comparable statistics across languages and different datasets, all token counts are computed using the Gemma-3 tokenizer,⁸ a SentencePiece model with a vocabulary of 256K sub-words, providing good coverage for all target languages

As is evident from these numbers, HPLT 3.0 is by far the largest publicly available such dataset, and its multilingual breadth compares favourably to other widely used resources. In Gemma-3 tokens, the multilingual HPLT 3.0 partition is about 2–3 times larger than FineWeb and the earlier version HPLT 2.0, respectively, and five times larger than the older MADLAD-400 dataset. In terms of average document length, which often is correlated with text quality, HPLT 3.0 and 2.0 pattern alike, markedly ahead of FineWeb but well behind MADLAD-400. For a small selection of European languages, the table shows languages ranging between a ‘mere’ billion of available tokens to others with hundreds of billions.

In-depth analytics

Training data quality arguably is the most important factor in model quality, but in-depth data inspection at scale is a challenging endeavour. HPLT has developed an open-source tool, HPLT Analytics, to compute a broad range of fine-grained statistics and enable interactive visualisation and exploration. The datasets are internally structured in documents, paragraph-like segments, and tokens. Descriptive frequency and length statistics, combined with basic correlation analysis with metadata like internet domains or predicted text register labels, can reveal distributional trends or outliers. Annotations are predominantly available at the document level, but in some cases also for smaller units.

Contrasting the distributions of document versus segment language predictions allows for insights into degrees of in-document ‘code switching’ and uncertainty in language identification, particularly among closely related languages.

The project has developed a framework for automated large-scale multilingual evaluation, called HPLT-e, to gauge data quality and inform design choices in training data preparation. This framework includes 127 language understanding and generation tasks across nine European languages, selected for native speaker availability in the project team and diversity in language resources, families, and scripts. The tasks in HPLT-e are drawn from benchmark suites, with an emphasis on natively constructed tasks and human-written prompts to address prompt sensitivity challenges. Models trained on different datasets show performance improvements over time, with models pretrained on MADLAD-400 achieving the highest multilingual score.

In addition to training data creation, the project has developed a variety of language models supporting different languages and language groups. These models include monolingual encoder-only, encoder-decoder, decoder-only, and large generative models for Finnish and Norwegian. The project has also mined bilingual text for machine translation, creating parallel text corpora for 57 language pairs and providing 2.7 million sentence alignments in total.

The project has also focused on developing and evaluating new translation models for 100 language pairs, with an emphasis on compact models that can run locally on edge devices. These models, trained on HPLT data, show competitive performance, especially for lesser-resourced languages. In order to further decrease computational expenses, a pipeline was developed for systematic multilingual knowledge distillation. This process facilitates the transition from costly teacher models to more compact student models that can be as small as 20 megabytes in size.

The computational infrastructure supporting the work in HPLT has been highly intensive in terms of computation and storage. This was made achievable through a combination of resources provided by the project grant and additional substantial resources allocated to consortium members from national quotas in Czech, Finland, and Norway, as well as through the EuroHPC system. A total of close to 21 petabytes of ‘bulk’ storage for large-scale web data was distributed across facilities in the Czech Republic (CESNET), Norway (Sigma2), and Finland (LUMI). Exclusive access to dedicated compute nodes integrated with the storage systems enabled a preliminary stage of lightweight document and metadata extraction, reducing the data volume for subsequent processing by approximately three times.

The EuroHPC LUMI system served as the primary computational resource for HPLT, with the consortium utilizing combined allocations of about 60 million CPU hours and 11.5 million GPU hours over the 40-month project duration. This is equivalent to more than 2,000 active CPUs on average at all times.

References:

1. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67
2. Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The FineWeb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811–30849
3. Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. 2025. FineWeb2: One pipeline to scale them all – adapting pre-training data processing to every language

Please note, this article will also appear in the 25th edition of our quarterly publication. Transform the following:

Original: “I have always wanted to visit Japan.”

Transformed: “Visiting Japan has always been a dream of mine.”