AI training data comes with a price tag that only Big Tech can afford

Data is at the heart of today's advanced AI systems, but it is increasingly expensive, making it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets on which they are trained. In it, Betker claimed that training data – not a model's design, architecture, or any other feature – was the key to increasingly sophisticated, capable AI systems.

“If you train on the same data set long enough, virtually every model converges to the same point,” Betker wrote.

Is Betker right? Is training data the biggest determinant of what a model can do, whether it's answering a question, drawing human hands, or generating a realistic cityscape?

It's certainly plausible.

Statistical machines

Generative AI systems are essentially probabilistic models: a huge pile of statistics. Based on large amounts of examples, they guess which data is most “logical” to place where (for example, the word “go” before “to the market” in the sentence “I'm going to the market”). So it seems intuitive that the more examples a model must have, the better the performance of models trained on those examples.

“It seems like the performance gains come from data,” Kyle Lo, a senior applied researcher at the Allen Institute for AI (AI2), a nonprofit AI research organization, told JS, “at least once you have a stable training setup have.”

Lo gave the example of Meta's Llama 3, a text-generating model released earlier this year that outperforms AI2's own OLMo model despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo says explains its superiority on many popular AI benchmarks.

(I would like to point out here that the benchmarks commonly used in the AI ​​industry today are not necessarily the best measure of a model's performance, but outside of qualitative tests like ours, they are one of the few measurements we need do continue.)

That's not to say that training on exponentially larger data sets is a sure path to exponentially better models. Models operate on the “garbage in, garbage out” paradigm, Lo notes, which is why data curation and quality are of great importance, perhaps even more important than just quantity.

“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, the Falcon 180B, a large model, ranks 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, ranks 56th.”

In an interview with JS last October, OpenAI researcher Gabriel Goh said that higher quality annotations have greatly contributed to the improved image quality in DALL-E 3, OpenAI's text-to-image model, compared to its predecessor DALL-E 2 “I think this is the main source of the improvements,” he said. “The text annotations are a lot better than they were [with DALL-E 2] – it's not even comparable.”

Many AI models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to associate those labels with other observed features of that data. For example, a model that is given many cat photos with annotations for each breed will eventually “learn” terms like short tail And short hair with their distinctive visual features.

Bad behavior

Experts like Lo worry that the growing emphasis on large, high-quality training datasets will centralize AI development among the few players with billion-dollar budgets who can afford to purchase these sets. Major innovation in the field of synthetic data or fundamental architecture could disrupt the status quo, but neither seems on the near horizon.

“In general, entities that manage content potentially useful for AI development are incentivized to shelve their material,” Lo says. “And as access to data approaches, we are essentially blessing a few early movers in data acquisition and pulling up the ladder so that no one else can access data to catch up.”

Where the race to collect more training data hasn't led to unethical (and perhaps even illegal) behavior like secretly collecting copyrighted content, it has rewarded tech giants with deep pockets to spend on data licensing.

Generative AI models like OpenAI's are trained primarily on images, text, audio, videos, and other data (some of which is copyrighted) sourced from public web pages (including problematic, AI-generated). The OpenAIs of the world claim that fair use protects them from legal reprisal. Many rights holders disagree, but for now there is not much they can do to prevent this practice.

There are many examples of generative AI vendors questionably acquiring massive data sets to train their models. OpenAI Reportedly has transcribed over a million hours of YouTube videos without the blessing of YouTube (or the blessing of the creators) to feed its flagship model GPT-4. Google recently partially expanded its terms of service to allow use of public Google Docs, restaurant reviews on Google Maps, and other online materials for its AI products. And Meta would have considered risking lawsuits train his models on IP-protected content.

Meanwhile, companies large and small rely on it workers in third world countries were paid only a few dollars an hour to create annotations for training sets. Some of these annotators — employed by gigantic startups like Scale AI – literally work for days on end to perform tasks that expose them to graphic depictions of violence and bloodshed without any benefits or guarantees of future appearances.

Rising costs

In other words, even the more fair data transactions don't exactly promote an open and fair generative AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, stock media libraries, and more to train its AI models — a budget that far exceeds that of most academic research groups, nonprofits, and startups. Meta has even gone so far as to consider acquiring publisher Simon & Schuster for the rights to ebook excerpts (eventually selling Simon & Schuster to private equity firm KKR in 2023 for $1.62 billion).

Now the AI ​​training data market is expected to do the same to grow From roughly $2.5 billion today to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their users.

Inventory media library Shutterstock has inked does business with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data for organizations like Google and OpenAI. Few platforms have an abundance of data collected organically over the years not It appears that agreements have been signed with generative AI developers – from Photobucket to Tumblr to the question and answer site Stack Overflow.

It is the platforms' data that you have to sell – at least depending on which legal arguments you believe. But in most cases, users don't see a cent of the profit. And it hurts the broader AI research community.

“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I fear this could lead to a lack of independent oversight of AI development practices.”

Independent efforts

If there is a ray of sunshine shining through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a generative AI model.

EleutherAI, a non-profit research group that started as a loose Discord collective in 2020, is teaming up with the University of Toronto, AI2 and independent researchers to create The Pile v2, a series of billions of text passages sourced primarily from the public domain .

In April, AI startup Hugging Face released FineWeb, a filtered version of the Common Crawl — the eponymous dataset maintained by the nonprofit Common Crawl and made up of billions upon billions of web pages — which Hugging Face claims will improve model performance improves on many benchmarks.

A number of attempts to release open training datasets, such as the LAION group's image sets, have encountered copyright, data privacy and other concerns, equally serious ethical and legal challenges. But some of the more dedicated data curators have vowed to do better. For example, The Pile v2 removes problematic copyrighted material from the original dataset, The Pile.

The question is whether any of these open efforts can keep pace with Big Tech. As long as data collection and management remains a matter of resources, the answer is likely no – at least not until a breakthrough in research levels the playing field.

Related Posts

As 'zombie deer disease' spreads, scientists search for answers

This story originally appeared on Yale Environment 360. Late last year, federal officials discovered the carcass of a mule deer near Yellowstone Lake in a remote area of ​​Yellowstone National…

Britain is investigating HPE's planned $14 billion takeover of Juniper Networks

The British Competition and Markets Authority (CMA) has done this initiated a formal “Phase 1” investigation into the planned acquisition of Juniper Networks by Hewlett Packard Enterprise (HPE). The CMA…

Leave a Reply

Your email address will not be published. Required fields are marked *

You Missed

What is open and closed on Juneteenth? See which shops and restaurants are open today.

  • June 19, 2024
What is open and closed on Juneteenth?  See which shops and restaurants are open today.

shares, news and UK inflation data

  • June 19, 2024
shares, news and UK inflation data

10 ways BPD services can effectively manage symptoms and improve quality of life

  • June 19, 2024
10 ways BPD services can effectively manage symptoms and improve quality of life

How grounding came into existence and its modern applications

  • June 19, 2024
How grounding came into existence and its modern applications

As 'zombie deer disease' spreads, scientists search for answers

  • June 19, 2024
As 'zombie deer disease' spreads, scientists search for answers

Here are Wall Street's favorite picks in the S&P 500 for the second half of the year

  • June 19, 2024
Here are Wall Street's favorite picks in the S&P 500 for the second half of the year

Here is the complete list of hurricane names for the 2024 season

  • June 19, 2024
Here is the complete list of hurricane names for the 2024 season

How to protect yourself from a passive-aggressive partner

  • June 19, 2024
How to protect yourself from a passive-aggressive partner

Large wildfires create weather that promotes more fire

  • June 19, 2024
Large wildfires create weather that promotes more fire

Britain is investigating HPE's planned $14 billion takeover of Juniper Networks

  • June 19, 2024
Britain is investigating HPE's planned $14 billion takeover of Juniper Networks

Nvidia overnight rally lifts chip-related stocks in Asia on AI optimism

  • June 19, 2024
Nvidia overnight rally lifts chip-related stocks in Asia on AI optimism