Is LLM Training Data Running Out?

Data scarcity is one of the major problems tech companies will face a few years down the line. AI companies are estimated to exhaust publicly available data for LLM training at the turn of the decade. It is unclear how they plan to keep up with the rising demand for data, the building blocks of an LLM’s consciousness. Spiceworks News & Insights looks at some of the options they will have.

June 19, 2024

is AI training data running out?
(Credits: Beautrium/Shutterstock.com)

  • AI model training is a rapidly growing and highly capital-, compute-, power-, and data-intensive process.
  • Funds can be procured, computing is advancing at full speed, power generation can be sourced through cleaner methods, and AI talent is rapidly emerging.
  • However, sourcing publicly available data will be a major problem for AI companies at the turn of the decade.
  • How can companies ensure AI development doesn’t stall?

According to Stanford University’s AI Index Report 2024, the United States produced 61 noteworthy machine learning models in 2023, followed by China’s 15, France’s eight, Germany’s five, and Canada’s four. In the same year, 149 foundation models were released globally.

The rapid pace of development has spilled into 2024, with the industry setting unprecedented expectations and delivering innovation, growth, and higher integration.

Financially, training a single large language model (LLM) costs tens of millions of dollars. For instance, OpenAI CEO Sam Altman said the company spent over $100 million to train GPT-4. Google spent an estimated $191 million on computing to train Gemini Ultra. Having deep pockets and a propensity to splurge helps.

The demand for AI computing power is doubling every 100 days, according to Intelligent Computing: The Latest Advances, Challenges, and Future researchOpens a new window . It is projected to increase over a million times over the next five years. “With the slowing down of Moore’s law, it becomes challenging to keep up with such a rapid increase in computational capacity requirements,” the authors noted. NVIDIA and other chipmakers are on it.

Further, energy-guzzling LLM training may have alternative solutions. One example is Microsoft planning to build a small-scale reactor to replace fossil fuels for its data center and computing needs, hiring a director of nuclear tech to oversee its plans, and signing an agreement to source power from Helion Energy and its nuclear fusion tech.

Meanwhile, the demand for AI talent in the U.S. almost doubled from 8,611 in May 2023 to over 16,000 in May 2024, according to UMD-LinkUp AI MapsOpens a new window . Fortunately, that’s taking care of itself, as LinkedIn’s Global Talent Trends survey discovered.

The survey noted 17% higher application growth in the past two years in job posts that mention artificial intelligence or generative AI than in job posts with no such mentions. Additionally, 57% of professionals responded positively toward learning more about AI.

What is unclear is how AI companies plan to keep up with the rising demand for data, the building blocks of an LLM’s consciousness, so to speak.

See More: Will Prompt Engineering Change What It Means to Code?

Data Requirements for LLM Training

Let’s look at some numbers:

LLM

Tokens Release Date
GPT 2 ~10 billion

June 2018

GPT 3

~300 billion Feb 2019
Claude ~400 billion

Dec 2021

Gopher

~300 billion Dec 2021
LaMDA 168 billion

Jan 2022

PaLM

~780 billion April 2022
Llama 1.4 trillion

Feb 2023

GPT-4

~13 trillion Mar 2023
PaLM 2 3.6 trillion

May 2023

Llama 2

2 trillion Jul 2023
Claude 2 NA

Jul 2023

Grok-1

NA Nov 2023
Gemini 1.0 NA

Dec 2023

Claude 3

NA Mar 2024
Llama 3 NA

April 2024

Gemini 1.5 Pro

NA May 2024
GPT-4o NA

May 2024

Since companies have started withholding the training data size, consider the first ten models on the above list. Here’s how steep the rise is when it comes to using training data:

Tokens Used for LLM Training Over The Years

Tokens Used for LLM Training Over The Years

So far, an exponentially higher data ingestion rate for LLM training has been the primary vector of progress. Experts predict this cannot be sustainable over the long term.

See More: AI Benchmarks: Why GenAI Scoreboards Need an Overhaul

Are Companies Running Out of Training Data?

Research suggests that data scarcity is indeed a possibility for LLM training shortly. According to trends analyzed by Epoch AI, tech companies will exhaust publicly available data for LLM training between 2026 and 2032.

“The exact point in time at which this data would be fully utilized depends on how models are scaled. If models are trained compute-optimally, there is enough data to train a model with 5e28 floating-point operations (FLOP), a level we expect to be reached in 2028. But recent models, like Llama 3, are often ‘overtrained’ with fewer parameters and more data to make them more compute-efficient during inference,” Epoch AI researchers noted.

Projections of the stock of public data

Source: Will we run out of data? Limits of LLM scaling based on human-generated data studyOpens a new window by Epoch AI 

The crucial thing to note here is that in approximately a decade of generative AI’s existence, companies will have depleted freely available human-generated information in the form of articles, blogs, social media discussions, papers, etc. This rate of data ingestion in LLM training is thus considerably higher than the rate at which humans are producing data.

Moreover, tokens from multiple modalities (text, image, audio, video) for LLM training are also not enough as current video and image stocks are not large enough to prevent a data bottleneck, per the research. Here’s how many tokens each data modality corresponds to:

  • Common Crawl: 130 trillion
  • Indexed web: 510 trillion
  • The whole web: 3100 trillion
  • Images: 300 trillion
  • Video: 1350 trillion

Publicly available data scraped from the web forms the bedrock of LLM training, but is the situation dire (for AI companies)?

See More: GenAI in Legal Industry: Why Intelligent Document Processing Matters?

How Can AI Companies Scale LLMs Without Public Data?

Experts have advised overcoming the data scarcity problem through multiple methods involving offline information, synthetic data, and LLM efficiency improvements.

1. Cut deals with publishers for non-public data

Paywalled and non-indexed data can be leveraged to train LLMs, provided they are appropriately compensated for respective copyrights.

Content licensing is already a multi-million dollar reality since Google joined hands with Reddit for its data for $60 million. Similarly, OpenAI has signed deals with the Associated Press, Axel Springer, Le Monde, Prisa Medi, and Financial Times.

Offline information, such as books, manuscripts, magazines, etc., can be digitized and licensed for the correct fee.

Moreover, research data, including genomics, financial, scientific databases, etc., can be high-quality data in the right context.

Finally, non-indexed deep web data from social media (Facebook, Instagram, Twitter) and other sources and instant messengers remain untapped. Unfortunately, the former can be lower quality than web data, while the latter violates user privacy.

2. LLM advancements

Refining LLM architecture to consume less data to produce the same result can help contain unchecked data ingestion. Techniques such as reinforcement learning have helped attain sample efficiency gains.

Additionally, data enriching and high-quality sample filtering optimize Pareto efficiency, leading to higher LLM performance and training efficiency improvements, according to findings in the How to Train Data-Efficient LLMsOpens a new window study. Beyond quality, data coverage and diversity also play an important role in making LLMs efficient.

“It’s important to note that the relationship between data quantity and language model performance is not always linear. In some cases, doubling the training data may yield diminishing returns in metrics like perplexity or downstream task accuracy,” noted Sunil Ramlochan, enterprise AI strategist and founder of PromptEngineering.org.

“Determining how much data is needed to train a language model is an empirical question best answered through systematic experimentation at different orders of magnitude. By measuring model performance across varying data scales and considering factors like model architecture, task complexity, and data quality, NLP practitioners can make informed decisions about resource allocation and continuously optimize their language models over time.”

Transfer learning, i.e., when a model is initially pre-trained on a data-rich task before being finetuned on a downstream task, can be considered viable for AI training. One of the conclusions researchers derived from Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerOpens a new window study is as follows:

“One beneficial use of transfer learning is the possibility of attaining good performance on low-resource tasks. Low-resource tasks often occur (by definition) in settings where one lacks the assets to label more data.”

3. Synthetic data

When obtaining real-world data becomes problematic (such as websites banning web crawlers), expensive, or downright impossible after the well runs dry, synthetic data will come to the rescue. Gartner predicted that by 2026, 75% of businesses will use generative AI to create synthetic customer data.

Synthetic data has the benefit of having similar mathematical patterns without the information of the original data from which it was derived. It is algorithmically generated via computer simulations filled with new scenarios that an LLM can gorge upon.

Synthetic data can prove highly beneficial in limiting organizations’ dependence on internet data for LLM training. It bears the same correlations and statistical properties as real-world data but can go beyond artificially creating and introducing situations that can enrich the training experience.

Colossal amounts of synthetic data can be created relatively quickly with the help of existing LLMs, enabling faster development and project turnaround times. To that end, methods like DeepMind’s Reinforced Self-Training (ReST) for Language ModelingOpens a new window can help.

However, synthetic data has limitations, including biased responses, inaccuracies, hallucinations, and security and privacy risks. It can also be quite simplistic and thus fail to capture the nuances of real-world scenarios.

Machine-generated synthetic data may also cause LLMs to become an echo chamber of poor outputs, leading to what researchers call Mad Autophagy Disorder (MAD). In Self-Consuming Generative Models Go MADOpens a new window , researchers concluded, “Without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease.”

MORE ON ARTIFICIAL Intelligence

Sumeet Wadhwani
Sumeet Wadhwani

Asst. Editor, Spiceworks Ziff Davis

An earnest copywriter at heart, Sumeet is what you'd call a jack of all trades, rather techs. A self-proclaimed 'half-engineer', he dropped out of Computer Engineering to answer his creative calling pertaining to all things digital. He now writes what techies engineer. As a technology editor and writer for News and Feature articles on Spiceworks (formerly Toolbox), Sumeet covers a broad range of topics from cybersecurity, cloud, AI, emerging tech innovation, hardware, semiconductors, et al. Sumeet compounds his geopolitical interests with cartophilia and antiquarianism, not to mention the economics of current world affairs. He bleeds Blue for Chelsea and Team India! To share quotes or your inputs for stories, please get in touch on [email protected]
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.