AI Industries Battle for Training Data


A recent story on the Verge highlighted how major AI tech companies are struggling to train their AI models and are competing for high quality training data.

In recent reports published by The Wall Street Journal and The New York Times, a significant issue related to artificial intelligence (AI) has been brought to light. It seems that AI companies are having a hard time getting high-quality training data, leading them to employ questionable methods that fall within the boundary of copyright law.

For instance, OpenAI, a prominent player in the field, needed data for its advanced language model, GPT-4. To address the issue, the company reportedly developed its Whisper audio transcription model, which reportedly transcribed over a million hours of YouTube videos, under the personal supervision of Greg Brockman, President of OpenAI. The Times reported that OpenAI reportedly acknowledged the legal ambiguity of this approach and believed that it fell within the purview of fair use.

Google, another influential entity in AI, explored using YouTube content to train its models “on some YouTube content, in accordance with our agreements with YouTube creators,”, Matt Bryant, Google spokesperson, told The Verge. Discussing OpenAI’s activities, Bryant added that “both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content”.

The Times writes that Google’s legal department asked the company’s privacy team to tweak its policy language to expand what it could do with consumer data, such as its office tools like Google Docs. The new policy was reportedly intentionally released on July 1st to take advantage of the distraction of the Independence Day holiday weekend.

-The Verge

Meta (formerly Facebook) reportedly encountered its own set of challenges in procuring suitable training data. The Times reported that Meta’s AI team had internal deliberations about utilizing copyrighted works without authorization to catch up with OpenAI and even considering the acquisition of a major publishing company. It looks like that Meta had a hard time finding training data amid increasingly stringent privacy regulations, particularly in the aftermath of the Cambridge Analytica scandal.

“Google, OpenAI, and the broader AI training world are wrestling with quickly-evaporating training data for their models, which get better the more data they absorb. The Journal wrote this week that companies may outpace new content by 2028″, the Verge reports.

Proposed solutions to mitigate this predicament include training models on “synthetic” data generated by their own systems or employing “curriculum learning” methodologies. Nevertheless, these approaches are not without their limitations and uncertainties.

In light of these developments, it is clear that AI companies have a hard time keeping important data safe while adhering to legal and ethical standards. How they navigate this complex situation will determine the future of AI innovation and regulation.

Acknowledgements: ChatGPT was used to summarize some of the content and prepare the first draft.