Figure 1: Count of 47 text generator LLMs published between 2019 and October 2023 using Common Crawl for their pre-training. “Unknown” refers to instances where AI builders did not disclose enough information about the pre-training data to determine whether Common Crawl was used.