Llama 2's Language distribution in pretraining data with percentage

The distribution of languages in Llama 2's corpus, subsetted to those found in more than 0.005% of the documents.

Most data is in English, meaning that Llama 2 will perform best for English-language use cases. The large unknown category is partially made up of programming code data.
Chart: Slator Source: Meta AI