A Transparent Approach to AI Training Data

A group of researchers has created an eight-terabyte dataset using only openly licensed or public domain text for training AI models. This approach challenges the prevalent practice of scraping copyrighted materials. Their findings could influence ongoing policy debates surrounding AI and copyright law.

FUTUREUSAGETOOLSPOLICY

The AI Maker

5/28/20262 min read

"AI researchers create an eight-terabyte dataset using only openly licensed text
"AI researchers create an eight-terabyte dataset using only openly licensed text

The ongoing debate over AI and copyright is heating up, with major implications for how AI systems are trained. As artificial intelligence companies continuously argue for the necessity of scraping copyrighted materials to build powerful models, a new paper sheds light on a different, more transparent approach that has emerged. A group of over two dozen researchers has successfully created an eight-terabyte dataset using only text that is openly licensed or in the public domain.

This dataset was utilized to train a 7 billion parameter language model that performed comparably to industry standards, such as Meta's Llama 2-7B (https://ai.meta.com/llama/) . However, this endeavor was neither quick nor easy. The researchers faced numerous challenges, including the need for manual data curation and the complexities of determining the appropriate licenses for various online content.

Their paper, published recently, highlights how this method stands in stark contrast to the more common practice of using copyrighted materials without permission. While the paper does not explicitly take a stance on the legality of scraping, it emphasizes the importance of ethical data sourcing in the ongoing policy discussions surrounding AI.

Recent events, such as Reddit's lawsuit against Anthropic (https://www.reddit.com/r/reddit.com/comments/12i3k0i/reddit_sues_anthropic_for_scraping_data_without/) for alleged unauthorized data access, and the U.K. House of Commons proposing changes to copyright laws, highlight the urgency of these discussions. The policy landscape is shifting, especially following President Donald Trump’s recent changes in leadership at the U.S. Copyright Office, which has reignited debates on fair use and AI.

Despite the heavy lifting involved in creating their dataset, the researchers believe their approach could pave the way for a more transparent AI future. Aviya Skowron, one of the paper's co-authors and head of policy at the nonprofit Eleuther AI (https://www.eleuther.ai/) , encourages others in the field to consider the complexities of ethical data sourcing.

In an industry often dominated by large corporations, this initiative represents a significant effort to demonstrate that ethical AI development is possible, even if it requires a bit more elbow grease. The new dataset, dubbed Common Pile v0.1, and the associated model, Comma v0.1, illustrate a commitment to finding more openly licensed content for future AI training.

Ultimately, while the authors may not expect industry giants like OpenAI (https://openai.com/) or Anthropic to adopt this labor-intensive method, they hope it will spark a renewed conversation about transparency in AI development. Partial transparency, they argue, can yield significant social and scientific benefits.

Cited: https://www.washingtonpost.com/politics/2025/06/05/tech-brief-ai-copyright-report/