OpenAI and Google trained their AI models on text transcribed from YouTube videos, potentially violating creators’ copyrights, according to The New York Times.
Note – the New York Times is embroiled in copyright lawsuits over AI, where they clearly show they don’t understand that an AI reading content is the same as a person reading content; that content being offered up for free with no paywall is free for everyone and that entering content and then asking for it back doesn’t mean that copyright is infringed.
[…]
It comes just days after YouTube CEO Neal Mohan said in an interview with Bloomberg Originals that OpenAI’s alleged use of YouTube videos to train its new text-to-video generator, Sora, would go against the platform’s policies.
According to the NYT, OpenAI used its Whisper speech recognition tool to transcribe more than one million hours of YouTube videos, which were then used to train GPT-4. The Information previously reported that OpenAI had used YouTube videos and podcasts to train the two AI systems. OpenAI president Greg Brockman was reportedly among the people on this team. Per Google’s rules, “unauthorized scraping or downloading of YouTube content” is not allowed
[…]
The way the data is stored in an ML model means that the data is not scraped or downloaded – unless you consider every view downloading or scraping though.
What this shows is a determination to ride the AI hype and find a way to monetise content that has already been released into the public domain without any extra effort apart from hiring a bunch of lawyers. The players are big and the payoff is potentially huge in terms of cash, but in terms of setting back progress, throwing everything under the copyright bus is a staggering disaster.
Source: OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models
Robin Edgar
Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft