Tech Giants Use YouTube Subtitles for AI Training Without Permission

Cryptopolitan · 2024-07-16T22:53:05.000Z

Apple, Nvidia, and Anthropic have been found to be using YouTube subtitles to train AI models, which is against YouTube policies. A report by Proof News and Wired showed that such firms had used a dataset of the transcripts from thousands of YouTube videos without properly acquiring the license to do so. Also Read: UK watchdog launches probe into Microsoft’s AI talent acquisition The study revealed that Apple, Nvidia, and Anthropic used the YouTube Subtitles dataset. This dataset consists of transcripts from 173,536 YouTube videos from 48,000 channels. The videos include educational channels like Khan Academy and MIT, news channels like The Wall Street Journal, and top creators like MrBeast and Marques Brownlee. Popular YouTubers react to data exploitation Marques Brownlee, a popular YouTuber, commented on the issue on X. He said, “Apple has gathered data for AI from other firms. One of them collected a lot of data/transcripts from YouTube videos, including mine. ” While Apple may not have scraped the data directly, and Brownlee pointed out that this problem will persist. The “YouTube Subtitles” dataset was developed by EleutherAI and published in 2020. It contains 5. 7GB of data, which includes subtitles from the YouTube videos that have been removed from the platform. According to YouTube’s terms and conditions, accessing videos by “automated means” is prohibited. The existence of subtitles from removed videos only adds to the issue, raising questions about privacy and copyright infringement. Salesforce, an organization also implicated in the probe, has also admitted to having used said dataset. “The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes. The dataset was publicly available and released under a permissive license.” Salesforce spokesperson However, the use of YouTube content without permission is still controversial to this date. In April, YouTube CEO Neal Mohan said that using YouTube videos, transcripts, or clips for AI training is a “clear violation” of the policies. However, according to the New York Times, OpenAI used a million hours of YouTube videos to train its GPT-4 model. Legal battles erupt over AI companies’ use of internet content The issue of AI corporations using content from the internet without authorization has increased after the launch of ChatGPT. Additionally, content creators are suing Stability AI and Midjourney for allegedly scraping copyrighted works without permission. YouTube’s owner, Google, faced class-action lawsuits regarding similar claims, stating that legal actions of this kind threaten the basis of generative AI. In an interview with The Wall Street Journal, OpenAI’s CTO Mira Murati did not elaborate on whether the company used videos from social media platforms to train this new model. Microsoft AI CEO Mustafa Suleyman stated that content on the open web had been considered fair use since the 1990s based on what he called the “social contract.”

Odkryto, że Apple, Nvidia i Anthropic używają napisów YouTube do uczenia modeli sztucznej inteligencji, co jest sprzeczne z zasadami YouTube. Z raportu Proof News i Wired wynika, że ​​takie firmy wykorzystywały zbiór danych zawierający transkrypcje tysięcy filmów na YouTube, nie uzyskując na to odpowiedniej licencji. 
Przeczytaj także: Brytyjski organ nadzoru wszczyna dochodzenie w sprawie pozyskiwania talentów AI przez Microsoft
Badanie wykazało, że Apple, Nvidia i Anthropic korzystały ze zbioru danych napisów YouTube. Ten zbiór danych składa się z transkrypcji 173 536 filmów w YouTube z 48 000 kanałów. Wśród filmów znajdują się kanały edukacyjne, takie jak Khan Academy i MIT, kanały informacyjne, takie jak The Wall Street Journal, oraz czołowi twórcy, tacy jak MrBeast i Marques Brownlee. 
Popularni YouTuberzy reagują na wykorzystanie danych
Marques Brownlee, popularny YouTuber, skomentował tę kwestię w X. Powiedział: „Apple zebrał dane dotyczące sztucznej inteligencji od innych firm. Jeden z nich zebrał wiele danych/transkrypcji z filmów na YouTube, w tym moich. Chociaż Apple mogło nie pobrać danych bezpośrednio, a Brownlee zauważył, że problem będzie się powtarzał.
Zbiór danych „YouTube Subtitles” został opracowany przez EleutherAI i opublikowany w 2020 roku. Zawiera 5,7 GB danych, w tym napisy z filmów YouTube, które zostały usunięte z platformy. 
Zgodnie z warunkami korzystania z serwisu YouTube dostęp do filmów „w sposób zautomatyzowany” jest zabroniony. Istnienie napisów do usuniętych filmów tylko pogłębia problem, rodząc pytania dotyczące prywatności i naruszenia praw autorskich.
Salesforce, organizacja również zamieszana w dochodzenie, również przyznała się do wykorzystania wspomnianego zbioru danych. 
„Zbiór danych Pile, o którym mowa w artykule badawczym, został przeszkolony w 2021 r. do celów akademickich i badawczych. Zbiór danych był publicznie dostępny i udostępniony na podstawie liberalnej licencji.”
Rzecznik Salesforce’a 
Jednak wykorzystywanie treści YouTube bez pozwolenia do dziś budzi kontrowersje. W kwietniu dyrektor generalny YouTube, Neal Mohan, powiedział, że wykorzystywanie filmów, transkrypcji i klipów z YouTube do szkolenia w zakresie sztucznej inteligencji stanowi „wyraźne naruszenie” zasad. Jednak według New York Times OpenAI wykorzystało milion godzin filmów na YouTube do szkolenia swojego modelu GPT-4. 
Wybuchają spory prawne w związku z wykorzystaniem treści internetowych przez firmy zajmujące się sztuczną inteligencją
Problem wykorzystywania przez korporacje AI treści z Internetu bez zezwolenia wzrósł po uruchomieniu ChatGPT. Ponadto twórcy treści pozywają Stability AI i Midjourney za rzekome skrobanie dzieł chronionych prawem autorskim bez pozwolenia. Właściciel YouTube, Google, stanął w obliczu pozwów zbiorowych dotyczących podobnych roszczeń, stwierdzając, że tego rodzaju działania prawne zagrażają podstawom generatywnej sztucznej inteligencji. 
W wywiadzie dla The Wall Street Journal dyrektor ds. technicznych OpenAI Mira Murati nie wyjaśniła, czy firma wykorzystywała filmy z platform mediów społecznościowych do szkolenia nowego modelu. Dyrektor generalny Microsoft AI, Mustafa Suleyman, stwierdził, że treści w otwartej sieci są uznawane za dozwolony użytek od lat 90. XX wieku na podstawie tego, co nazwał „umową społeczną”. 

Odkryj więcej od twórcy

Najnowsze wiadomości