Llama 2's Language distribution in pretraining data with percentage

The distribution of languages in Llama 2's corpus, subsetted to those found in more than 0.005% of the documents.

(Please use a modern browser to see the interactive version of this visualization)