Teuken Model Card and multilingual benchmarks

Benchmarks are an important indicator for measuring the general performance of models. The best way to find out whether Teuken is a suitable model for the respective use case in the company is to use it in the respective application. You are welcome to contact us and we will support you.

»Teuken 7B« is available in two versions

“Teuken 7B-instruct-research-v0.4” can be used for research purposes, “Teuken 7B-instruct-commercial-v0.4” is available to companies for commercial purposes under the “Apache 2.0” license.

»Teuken 7B-instruct-v0.4« performs consistently well across a wide range of EU languages. To achieve this, the underlying base model was not trained primarily in English, but rather multilingual from scratch in all 24 EU languages.

»Teuken 7B-instruct-commercial-v.04« is roughly comparable with the research version, although the research version achieves slightly better results in the range of one to two percent across benchmarks. The reason is that some of the data sets used for instruction tuning exclude commercial use and therefore were not used in the Apache 2.0 version.

Compare models: Our European LLM Leaderboard

For the first time, it is possible to compare the performance of LLMs across almost all EU languages with the help of our European LLM Leaderboard, instead of only using English evaluation datasets as was previously the case. For this purpose, the benchmark datasets HellaSwag, ARC and TruthfulQA, among others, were initially translated into a total of 21 languages using high-quality machine translation.

What benchmarks have we used to compare our models? And what do they mean?

The HellaSwag dataset asks multiple-choice questions to complete sentences with a high potential for confusion in order to assess the everyday knowledge and narrative coherence of models.

The ARC dataset asks multiple-choice questions to assess the capabilities of AI models in relation to different types of knowledge and thought processes.

The TruthfulQA dataset measures the truthfulness of answers generated by language models and to distinguish between true and false information.

GSM8K is a benchmark of 8,000 math word problems designed to evaluate a language model’s ability to handle elementary-level mathematical reasoning and problem-solving.

MMLU is a broad benchmark testing models across 50+ subjects, assessing their knowledge in areas from humanities to sciences at various academic levels.

In the following we present »Teuken 7B-instruct-research-v0.4« in detail.

Next to code, Teuken models contain approximately 50 % non-English from 23 European countries, and only around 40 % of English pretraining data. Thus, Teuken differs from most multilingual models available to date, which were only extended with multilingual data during continued pretraining or fine-tuning. For example, Meta-Llama-3.1-8B was trained only on eight percent non-English data which is markedly different from the language composition of Teuken pretraining data.

The chart shows the comparison of »Teuken 7B-instruct-research-v0.4« with instruction-tuned open source LLMs with seven to eight billion parameters based on selected benchmarks with their performance (accuracy). The selection included

Mistral-7B-Instruct-v0.3 (trained on 8 T token)

Meta-Llama-3.1-8B-Instruct (trained on 15 T token)

Salamandra-7b-instruct (trained on 7.8 T token)

Occiglot-7b-eu5-Instruct (based on Mistral-7B-v0.1 that was trained on 8T token and trained on 293B tokens of additional multilingual and code data)

Pharia-1-LLM-7B-control-aligned (trained on 7.7 T token)

The evaluation results of the individual languages of the models (without fine-tuning) are shown here as an average value over 21 languages with Maltese, Croatian and Irish omitted as they could not be automatically translated with sufficient quality.

The bar chart shows the performance of »Teuken 7B-instruct-research-v0.4« in the multilingual benchmarks ARC-, HellaSwag- and TruthfulQA in comparison to similar-sized open source models. The bar indicates the respective task performance averaged over 21 languages and the averaged model performance across ARC-, HellaSwag- and TruthfulQA. With the selected benchmarks, »Teuken 7B-instruct-research-v0.4« is ahead of all other models on average. In the individual benchmarks ARC and HellaSwag, Teuken is in second place behind Salamandra-7b-instruct, and in TruthfulQA in second place behind Mistral-7B-instruct-v0.3. It is noteworthy, that these models are trained on 7.8 T and 8 T token, respectively.

Although our model performs well in the linguistic and knowledge tasks, there is still potential for development in the GSM8K and MMLU benchmarks. This will be addressed in future training runs.

In order to further improve the model in benchmarking tests, sufficiently high-quality training data is important, but also sufficient computing capacity must be available on special high-performance computers for pre-processing the training data, scientific experiments and training models.

As part of OpenGPT-X, around two million hours of computing time were made available in Germany by the German HPC centers. Thanks to the successful applications for computing time on MareNostrum, the training can be continued on EuroHPC systems in the future.

Despite the instruction tuning, large language models in general may generate content that is inappropriate, offensive, or harmful. Our evaluation of bias and toxicity shows that »Teuken 7B-instruct-research-v0.4« is in the middle of the field compared to other models, meaning that there is still potential for development in the bias and toxicity benchmarks.

The chart shows the standard deviation of the mean values across 21 languages for the respective benchmarks. The lower the standard deviation, the more consistent the performance of the model for the task across all the languages examined. A low value therefore means that the model is more stable and reliable in its performance across languages.

The results show that »Teuken 7B-instruct-research-v0.4« is second to Salamandra-7b-instruct with HellaSwag and ARC. With TruthfulQA, all models achieve a comparatively low standard deviation. The lowest standard deviation in TruthfulQA is achieved by Meta-Llama-3.1-8B, closely followed by »Teuken 7B-instruct-research-v0.4«.

On average across all three tasks, Salamandra-7b-instruct has the lowest standard deviation, with »Teuken 7B-instruct-research-v0.4« exhibiting the second lowest standard deviation with only a small margin. These results hint that models that are trained with a multilingual dataset may perform more reliably across languages. This hypothesis was further corroborated with other benchmarks.

Conversely, other models show for individual languages outliers upwards or downwards from the average performance.

Further research on multilingual models is needed to investigate and verify the effect observed here.

From the outset, the OpenGPT-X project placed particular emphasis on the efficient use of available computing time during model development. Hence, in the project a dedicated multilingual tokenizer was developed, a central element of large language models, to take into account the distribution of languages. The task of a tokenizer is to break down words into individual word components – the fewer tokens, the more (energy) efficiently and quickly a language model generates the answer.

The results show that multilingual language models trained from scratch with a multilingual tokenizer can be both trained and operated in multilingual AI applications in a more energy- and cost-efficient manner.

The data basis for the presentation of the percentage of additional costs is a fertility study. This key figure measures how many tokens a word is broken down into. A small fertility of the tokenizer reduces the required computing power of a language model in GigaFLOPS and thus reduces the usage costs. The ratio of the fertility of different tokenizers is shown in comparison to the English Llama 3 tokenization. Documents with the same content in different languages were processed by tokenizers and the average fertility was calculated.

The studies have shown that large language models require a disproportionately high amount of computing power due to tokenization in almost all non-English EU languages. This has a direct impact on the usage costs of the language model, as the usage costs are calculated based on the number of input and output tokens. Experiments on the tokenization of German texts as part of OpenGPT-X clearly show that language models based on a multilingually trained tokenizer require significantly less computing power.

The diagram shows the additional computing power required to process a non-English text with a tokenizer belonging to a language model (in % compared to Llama 3). Teuken models require the least amount of additional computing power and thus generate the lowest costs for this multlingual tasks.

Other language models generate a multiple of additional costs for non-English texts, for example 148 % in the case of Hungarian or 449 % (not shown) for Greek with the Mistral-7B-v03 language model. If a German text is tokenized with the Teuken-tokenizer, additional costs of only 22 % are incurred (compared to its English counterpart with Llama 3). The increase in efficiency is particularly noticeable for European languages with long words such as German, Finnish or Hungarian. This means that costs can be saved, and energy consumption reduced while maintaining the same model size and performance. In addition, the language model offers the possibility of handling longer queries that would otherwise exceed the limited context length of the model.

»Teuken 7B« is available in two versions

Compare models: Our European LLM Leaderboard

What benchmarks have we used to compare our models? And what do they mean?

Model Card

Leaderboard

Download