Citing copyright infringement, the Dutch-based organization BREIN has succeeded in taking down a large language dataset that was being used in training for AI. 

In a statement released on Tuesday, BREIN explained that the dataset comprised 10,000 books, news articles, and Dutch language subtitles for movies and TV series that were obtained without permission. 

EU’s AI Act aims to regulate training data sources

According to director Bastiaan van Ramshorst, it was not immediately clear how much the dataset could have been used by AI firms. “It’s very difficult to know, but we are trying to be on time” to avoid future lawsuits, he said.

The European Union’s recently proposed AI Act will also require AI companies to provide access to their dataset and source of data used to train AI models. Other related legal battles are still being fought in the United States. For example, Microsoft-backed OpenAI regularly gets involved in various legal issues, like the recent one with the New York Times.

Microsoft has been said to have allegedly copied the plaintiff’s registered journalism works in addition to other copyrighted journalism works. On the issue of potential infringement, the company’s CEO has been quoted as saying that the company has this data. 

The allegations suggest that Microsoft used these copyrighted materials in AI products, including ChatGPT and Copilot, without obtaining the licenses. The complaint specifically accuses Microsoft of removing significant information from these works. Such as the author’s name, title of work, ‘copyright’ watermark, and other restrictions. 

In Denmark, anti-piracy measures have also produced substantial results in the fight against copyright infringement. Last year, a copyright protection group based in Denmark, the Danish Rights Alliance, demanded and got the “Books3” dataset pulled down from the Internet.

Dataset provider complies with court order, removes content

The person who provided the Dutch dataset adhered to the court order made by BREIN. This agreement resulted in the dataset being taken down from the website that previously provided the dataset for download. BREIN refused to disclose the identity of a person involved in this case because of the Dutch privacy laws.

The removal of this dataset shows that copyright enforcement groups continue to fight for the protection of intellectual property rights in the digital world.  To address the issue of mass scraping of copyrighted materials, BREIN recommends rights holders use reservations as provided under the Copyright Act (Article 15o.1).