Large Language Model with European Perspective

Multilingual and Open Source: OpenGPT-X research project releases large language model

  • November 27, 2024
  • 132 views
  • Detailed description: End of Text Pic1, picture material: Fraunhofer IAIS
    Detailed description: End of Text Pic1, picture material: Fraunhofer IAIS
  • Detailed description: End of Text Pic2, picture material: Fraunhofer IAIS
    Detailed description: End of Text Pic2, picture material: Fraunhofer IAIS
  • Detailed description: End of Text Pic3, picture material: Fraunhofer IAIS
    Detailed description: End of Text Pic3, picture material: Fraunhofer IAIS

The large language model of the OpenGPT-X research project is now available for download on Hugging Face: "Teuken-7B" has been trained from scratch in all 24 official languages of the European Union (EU) and contains seven billion parameters. Researchers and companies can leverage this commercially usable open-source model for their own artificial intelligence (AI) applications. 

Funded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK), the OpenGPT-X consortium – led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS – have developed a large language model that is open source and has a distinctly European perspective.

“In the OpenGPT-X project, we've spent the last two years researching the underlying technologies for large AI foundation models and training models with leading industry and research partners. We are delighted to be able to make our 'Teuken-7B' model freely available, providing a public, research-based alternative for use in academia and industry,” says Prof. Stefan Wrobel, Director of Fraunhofer IAIS. “Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt and develop the model for their own work and applications. In this way, we want to contribute, both within the scientific community and together with companies from different industries to the growing demand for transparent and customizable generative AI solutions.”

Multilingual from scratch

Teuken-7B is currently one of the few large language models developed multilingually from the ground up. It contains approximately 50 percent non-English pre-training data and has been trained in all 24 official European languages. It has proven to be stable and reliable in its performance across multiple languages. This provides added value, particularly for international companies and organizations with multilingual communication requirements, products and services. The open-source model allows companies and organizations to run their own customized models in real-world applications. Sensitive corporate data can remain within the company.

In addition to model training, the OpenGPT-X team also addressed a number of research questions, such as how to train and operate multilingual AI language models in a more energy- and cost-efficient way. To this end, the project developed a multilingual “tokenizer”. The task of a tokenizer is to break down words into individual word components – the fewer tokens, the more (energy-) efficiently and quickly a language model can generate the answer. The developed tokenizer leads to a reduction in training costs compared to other multilingual tokenizers like Llama3 or Mistral. This is particularly valuable for European languages with longer word structures such as German, Finnish or Hungarian.

The OpenGPT-X project was funded by the BMWK program "Innovative and practical applications and data spaces in the Gaia-X digital ecosystem". Teuken-7B is accessible via the Gaia-X infrastructure. Actors in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. Unlike existing cloud solutions, Gaia-X is a federated ecosystem that allows service providers and data owners to connect. Data remains securely with its owners and is only shared under defined conditions.

“I am excited to witness today’s publication of Teuken-7B, a large language model based on Gaia-X, and would like to congratulate the OpenGPT-X project on having reached this important milestone. A special feature of Teuken-7B is that it enables the secure use of sensitive corporate data, as the Gaia-X standards guarantee data storage and processing in accordance with the strictest European data protection and security regulations. This new model and innovations like this strengthen the digital sovereignty, competitiveness and resilience of Germany and of Europe. This is why the Federal Ministry for Economic Affairs and Climate Action is funding the project with approximately 14 million euros in total,” says Dr. Franziska Brantner, Parliamentary State Secretary at BMWK.

Prof. Bernhard Grill, Director of Fraunhofer IIS, emphasizes the model’s potential for safety-critical applications: “With this independently developed language model, the project partners demonstrate their ability to generate their own large models. Access to a large language model enables applications that offer much greater control over this technology without the need for opaque third-party components – for example, in safety-critical fields such as automotive, robotics, medicine and finance. By training on data relevant to a specific application and using application-specific architectures, companies can create customized AI solutions that do not require ‘black box’ components.”

Generative AI by a strong consortium – with a European perspective

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing large amounts of data, leveraging powerful European HPC infrastructure and performing efficient model training. Teuken-7B was trained on the JUWELS supercomputer at Forschungszentrum Jülich. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the consortium’s partners include TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert, Westdeutscher Rundfunk (WDR) and the German AI Association (KI Bundesverband). The technology developed in OpenGPT-X will also provide the partners with a basis for training their own models in the future.

“OpenGPT-X is an example of how the resources of a publicly funded project and the collaborative efforts of a broad consortium can deliver valuable foundational technology – from underlying infrastructure to model training to productive applications. In the interest of technology and data sovereignty, it is important to build on this foundation: Our hope is that OpenGPT-X will lay the groundwork for many subsequent activities,” emphasizes Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.

The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimizations and evaluations of the models can take place.

The path to using Teuken-7B

Interested developers from academia or industry can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model has already been optimized for chat through “instruction tuning”. Instruction tuning is used to adapt large language models so that the model correctly understands instructions from users, which is important when using the models in practice – for example in a chat application.

Teuken-7B is freely available in two versions: one for research-only purposes and an “Apache 2.0” licensed version that can be used by companies for both research and commercial purposes and integrated into their own AI applications. The performance of the two models is roughly comparable, but some of the datasets used for instruction tuning preclude commercial use and were therefore not used in the Apache 2.0 version.

Download options and model cards can be found at the following link: https://huggingface.co/openGPT-X

The OpenGPT-X Discord Server is available to the specialist community for technical feedback, questions and specialist discussions: https://discord.gg/RvdHpGMvB3

Companies also have the opportunity to take part in free demo sessions in which Fraunhofer scientists explain which applications can be realized with Teuken-7B. Registration for demo appointments is possible via www.iais.fraunhofer.de/opengpt-x-en

Detailed technical background information and benchmarks as well as an overview of all research results from the OpenGPT-X project can be found on the project website at https://opengpt-x.de/en/models/teuken-7b
 

Description Pic1: Shown here is the language distribution of Teuken-7B-v0.4. Next to code Teuken-7B-v0.4 contains approximately 50% non-English text from 23 European countries and only around 40% of English pretraining data (for comparison, Meta-Llama-3.1-8B was trained on 8% non-English data). Thus, Teuken-7B-v0.4 differs from most multilingual models available to date, which were only extended with multilingual data during continued pretraining or fine-tuning.

Description Pic2: The bar chart shows the performance of Teuken-7B-instruct-research-v0.4 in the multilingual benchmarks ARC-, HellaSwag- and TruthfulQA in comparison to similar-sized open source models. The bar indicates the respective task performance averaged over 21 languages and the averaged model performance across ARC-, HellaSwag- and TruthfulQA. With the selected benchmarks, Teuken-7B-instruct-research-v0.4 is ahead of all other models on average. In the individual benchmarks ARC and HellaSwag, Teuken is in second place behind Salamandra-7b-instruct, and in TruthfulQA in second place behind Mistral-7B-instruct-v0.3.

Description Pic3: The diagram shows the additional computing power required to process a non-English text with a tokenizer belonging to a language model (in % compared to Llama 3). Teuken models require the least amount of additional computing power and thus generate the lowest costs for this multlingual tasks