By Avi Asher-Schapiro | U.S. Tech Correspondent
A new report from the tech investigation website Proof News found that some of the biggest companies in the world – including Apple, Amazon, and Salesforce – have been training their AI on YouTube videos without permission from the content creators. Proof News created a tool to search for videos in the YouTube AI training dataset. Which also includes content from The Wall Street Journal, NPR, and the BBC, as well as “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live.” The training data comes from a dataset put together by the AI company EleutherAI, pulled from the subtitles of thousands of videos. A picture illustration shows YouTube on a cell phone in June 2014. REUTERS/Dado Ruvic |
Creators interviewed by Proof News felt cheated. ““No one came to me and said, ‘We would like to use this,’” David Pakman, the host of “The David Pakman Show,” a left-leaning politics channel, told Proof News. The dataset is part of a compilation called the Pile, which also includes material from the European Parliament, English Wikipedia, and a trove of Enron Corporation employees’ emails that was released as part of a federal investigation. YouTube Subtitles and other types of speech to text data are potentially a “gold mine,” said AI policy researcher Jai Vipra, because they can help train models to replicate how people talk and converse. “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors,” Anthropic said in a statement. Representatives at EleutherAI, the creators of the dataset, did not respond to requests for comment. |