Large Public Datasets Used to Train AI Language and Multimodal Models

Datasets created by non-profit or academic research entities

FOR MORE DATA, VISIT Variety
Source: Developer research papers
Note: This list is non-comprehensive, based on available information about datasets, their exact contents, and which AI models they were involved in training.