NVIDIA Faces Scrutiny Over Alleged Unlicensed Data Scraping for AI Models

Cryptopolitan · 2024-08-05T21:13:01.000Z

Leaked documents obtained by 404 Media suggest NVIDIA engaged in unlicensed data scraping, using movie and game footage from across the internet to train its artificial intelligence products. The leaked documents reveal that they were trying to download full movies from various channels, including Netflix, and their primary interest was in YouTube videos. From the emails obtained by 404 Media, the project managers intended to employ between 20 and 30 virtual machines on Amazon Web Services to obtain 80 years of videos in a day. NVIDIA defends its actions and invokes fair use provisions Data scraping is the practice of extracting video, textual, and audio content from the internet without the permission of the content owners to train AI models. This practice could be seen as the use of content from social media platforms that contain copyrighted content. NVIDIA has said that it did not break any copyright laws in the process of data scraping. The company also stated that its activities fall under the fair use doctrine because it utilizes copyrighted material for training AI. Documents obtained from internal communications by 404 Media indicate that some NVIDIA employees expressed concerns over these data scraping activities. However, project managers allegedly downplayed the concerns, stating that legal concerns, for example, violations of YouTube’s Terms of Service, would be dealt with later on. One employee pointed out that NVIDIA’s AI engineers tried to get as many game clips as possible to enrich the training corpus. This entailed streaming the gameplay to NVIDIA’s GeForceNow cloud service to record gameplay videos in high definition.Jim Fan, senior research analyst, in internal messages also stressed the importance of such footage as the input for the training of the AI model. Company takes steps to manage public perception of data practices The documents also detail NVIDIA’s attempts at damage control over the repercussions of such practices. According to leaked emails, Research VP Ming-Yu Liu recommended that the company should avoid releasing any papers related to the data scraping techniques to prevent public backlash. It also created its own set of YouTube data scraping tools and API accounts to help in the data-gathering process. The legal position regarding the rules governing the use of AI in scraping data is still not very clear. According to MIT’s Robert Mahari, it can be quite complicated to establish that data scraping has indeed occurred. Organizations may gain from not revealing the sources of their training data as it becomes hard to prove abuse in the absence of tangible proof. Another platform, Suno, an AI music generation platform, recently came under the spotlight for admitting the use of data scraping to train artificial intelligence models. As previously reported by Cryptopolitan, Reddit CEO Steve Huffman stated that the company will continue to prohibit Microsoft and other AI firms from using data scraping until payment is made and control of how the data is used is gained by the platform. He said that Reddit would not permit data scraping for use in training AI models without the proper license.

404 Media 獲得的泄露文件表明，NVIDIA 從事未經許可的數據抓取，使用來自互聯網的電影和遊戲鏡頭來訓練其人工智能產品。
泄露的文件顯示，他們試圖從 Netflix 等各個渠道下載整部電影，主要興趣是 YouTube 視頻。從 404 Media 獲得的電子郵件來看，項目經理打算在 Amazon Web Services 上使用 20 到 30 臺虛擬機，一天內獲取 80 年的視頻。
NVIDIA 爲其行爲辯護並援引合理使用條款
數據抓取是指在未經內容所有者許可的情況下從互聯網上提取視頻、文本和音頻內容來訓練人工智能模型的做法。這種做法可以看作是使用包含版權內容的社交媒體平臺內容。
NVIDIA 表示，在數據抓取過程中，它沒有違反任何版權法。該公司還表示，其活動符合合理使用原則，因爲它利用受版權保護的材料來訓練 AI。
404 Media 從內部通訊中獲得的文件顯示，部分 NVIDIA 員工對這些數據抓取活動表示擔憂。不過，據稱項目經理淡化了這些擔憂，並表示法律問題（例如違反 YouTube 服務條款）將在稍後處理。
一名員工指出，NVIDIA的AI工程師會盡可能多地獲取遊戲片段，以豐富訓練語料庫，包括將遊戲畫面串流至NVIDIA的GeForceNow雲端服務，錄製高清的遊戲畫面。資深研究分析師Jim Fan在內部訊息中也強調，這類畫面作爲AI模型訓練輸入的重要性。
公司採取措施管理公衆對數據實踐的看法
這些文件還詳細介紹了 NVIDIA 爲控制此類做法的後果所做的努力。根據泄露的電子郵件，研究副總裁劉明宇建議該公司應避免發佈任何與數據抓取技術相關的論文，以防止引起公衆的強烈反對。該公司還創建了自己的一套 YouTube 數據抓取工具和 API 帳戶，以幫助完成數據收集過程。
關於使用人工智能抓取數據的規則的法律地位仍然不太明確。根據麻省理工學院的羅伯特·馬哈里 (Robert Mahari) 的說法，確定數據抓取確實發生了可能相當複雜。組織可能會從不透露其訓練數據來源中獲益，因爲在沒有確鑿證據的情況下很難證明濫用行爲。
另一個平臺，AI 音樂生成平臺 Suno，最近因承認使用數據抓取來訓練人工智能模型而受到關注。正如 Cryptopolitan 之前報道的那樣，Reddit 首席執行官史蒂夫·霍夫曼表示，該公司將繼續禁止微軟和其他 AI 公司使用數據抓取，直到付款並由平臺獲得對數據使用方式的控制權。他表示，如果沒有適當的許可，Reddit 不會允許將數據抓取用於訓練 AI 模型。

創作者的更多內容

實時新聞

創作者的更多內容

實時新聞

熱門文章