Caseme

When will it all end: how much longer will we have data for training AI?

Modern artificial intelligence models, particularly the widely adopted LLMs (large language models), rely on vast amounts of information, striving to use all existing quality sources for training. Historically, computational power has been the key issue for AI development, but in recent years, the pace of technological progress has begun to outstrip the rate at which new data is created for datasets. With the advent of powerful chips, many researchers have become concerned that a shortage of quality information used for training models is not far off.

June 24, 2024

The limit is not far off.

How relevant is the problem of quality data for AI training? The question is far from trivial if we look at recent statements from top managers and founders of major AI projects. For instance, Anthropic AI startup co-founder Jack Clark notes that their models were trained on a significant percentage of all the data that ever existed on the internet. At the same time, in an interview with the WSJ, OpenAI's CTO Mira Murati did not give a clear answer to the question of whether developers used social media data to train the Sora model (a neural network for video generation). These statements are indirect evidence that leading companies in the AI field have already faced a shortage of available information for training models and may be using unauthorized sources.

Against this backdrop, there is increasing information that developers are experiencing a shortage of quality, publicly accessible, and, most importantly, legal data. OpenAI, the creator of ChatGPT, is frequently sued for copyright violations, prompting the company to expand its legal team.

In late December 2023, The New York Times filed a lawsuit accusing OpenAI and Microsoft of illegally using millions of articles for AI development. A number of American writers also filed lawsuits against the owner of ChatGPT, with one of the first being a joint lawsuit by comedian Sarah Silverman and two other artists against Meta (recognized as an extremist organization and banned in Russia) and OpenAI. All plaintiffs claimed that copyrighted materials were used in AI training.

Is the publication of media content or a scene from a cartoon used by artificial intelligence a copyright violation? Lawyers and lawmakers need to answer this question. It seems they are indeed concerned with this issue. If lawsuits by information holders are widely upheld, modern neural network developers will find themselves in a difficult position, facing significant obstacles to scaling their models.

We need more content.

This year, the Human-Centered Artificial Intelligence (HAI) center at Stanford University released a report on the development of artificial intelligence. In its first chapter, it specifically mentions that experts expect the exhaustion of public textual data between 2026 and 2032. Previous estimates by this group of scientists predicted a shortage of quality datasets for language models as early as 2024, but they have since improved their forecasts. Researchers have more favorable prospects regarding visual data—images and videos—expecting a shortage for machine learning models by 2038-2046.

The main reason for the information shortage is that the growth in demand exceeds the available machine-readable textual content. This content is created by humans, not generated by artificial intelligence. There is no clarity on how effective and responsible AI development will be based on data it has generated itself.

It is important to remember that a significant portion of the information used to train neural networks belongs to large companies and social networks. This is one reason why businesses are developing their AI projects. For example, through the popular RAG (retrieval augmented generation) technique—a layer on top of the basic (foundational) language model. It optimizes the responses of large language models by expanding their context with additional external data necessary for answering domain-specific queries. Thus, this technique helps adapt AI to specialized tasks with minimal effort and minimizes so-called "hallucinations" or false statements.

Market Monopolization and Information Sales

More and more large corporations are entering the full-scale race in the field of artificial intelligence. They are creating their own foundational models, having access to vast amounts of data that are limited for general use and becoming a significant competitive advantage in conditions of severe scarcity. For example, Elon Musk with the Grok model, trained based on data from X (formerly Twitter). Or Mark Zuckerberg with Llama 3, which set the quality standard for open-source AI in the previous generation. Or Google with the entire Gemini family of models. It is likely that this state of affairs will lead to the monopolization of the artificial intelligence market.

The shortage of information is already pushing developers to purchase it from private owners and companies that do not have serious ambitions in AI development. In mid-May 2024, information emerged that Reddit is collaborating with OpenAI to integrate ChatGPT. The platform is considering selling content for AI model training as a source of income. There were also reports of an agreement between Reddit and Alphabet (Google's parent company), allowing Google's AI models to use Reddit data.

Future Prospects

It seems likely that the shortage of data will become an obstacle to the creation of so-called "artificial general intelligence" (AGI) capable of performing intellectual work at a human level in the coming decades. Researchers will now shift their focus to improving the quality of datasets and utilizing internal corporate information to continue developing AI in a highly competitive environment. Big data will become a valuable commodity. This will put information corporations like Meta, which has access to the publications and communications of billions of people, in a privileged position.

However, when it comes to machine learning, it is not just the quantity but also the quality of information that matters. In particular, a recent FineWeb Edu study showed that large language models train significantly more effectively if data is carefully selected and low-quality materials are filtered out, even if the final dataset is an order of magnitude smaller than the original. It turned out that educational content is especially valuable for fine-tuning, and data processing can be entrusted directly to an AI algorithm.

Thus, the problem of data scarcity for training artificial intelligence models will, on the one hand, create a market for private datasets and, on the other, force developers to adopt more meticulous data collection and invent new architectural solutions. Despite the bleak forecasts regarding the depletion of available text resources, the market still has enough tools to continue improving and scaling current algorithms in the next decade.

Caseme