Webscraping and AI Datasets: if the purpose is in the public interest there is no copyright infringement

“The creation of a dataset … which can form the basis for the training of artificial intelligence systems, can certainly be considered scientific research … Although the creation of the dataset as such may not be immediately associated with an increase in knowledge, it constitutes an essential step for the goal of using it to subsequently acquire the knowledge in question ’ – by Andrea Monti – originally published in Strategikon – Italian Tech-La Repubblica

With this statement, the ruling 310 O 227/23* issued on 27 September 2024 by the Hamburg Regional Court sets a fundamental precedent for the development of artificial intelligence in the EU because it applies Article 60d of the German Copyright and Related Rights Act (a sort of ‘fair-use exemption’) also to datasets.

The creation of a free and publicly available dataset is ‘scientific research’.

In detail, this article of the German Copyright Act authorises the extraction of data from reproductions of protected works even without the consent of the rights holder, if the goal is carrying out scientific researches  by public and private entities that do not pursue profit and make the results of their work publicly available without restriction.

In summary, therefore, the basis of the reasonment is that non-profit research carried out in the public interest implements a solidaristic principle according to which proprietary rights can be limited if the results of the activity are freely shared with the community.

The extension of this principle of law also to datasets provides an element of certainty in the controversy over the use of extractable data from third-party content to create AI models.

The future Italian law on AI is based on the same principle, but for personal data

An aspect of further interest of the German court’s approach is its striking similarity to that adopted by the bill on artificial intelligence currently under discussion in the Italian Senate. Article 8 of the draft bill, in fact, establishes an analogous principle, albeit in relation to personal data and not to data that can be extracted from works protected by copyright.

Therefore, if the Italian regulation is approved, medical-scientific research carried out by public bodies and private non-profit entities (e.g. patients’ associations) will be able to take advantage of the operational simplification already allowed by the data protection regulation. Whereas companies that process the same data but for private gain remain subject to the obligations laid down by law for activities carried out in their own interest.

In other words, and in compliance with EU law, a two-speed lane is being designed: a faster speed one for those who work in the common interest and a slower one for those who, albeit legitimately, only pursue their own profit.

Companies can also do non-profit research but must share the results

As is by now clear, the difference between non-profit and for-profit is the pivot around which the entire German decision revolves, which, on this point, establishes a further interesting principle: non-profit activity can also be carried out by a commercial entity, as long as the results of the research are made available in a non-discriminatory manner to the entire community. The courts write, verbatim: ‘The question whether or not the research has non-commercial purposes depends solely on the specific nature of the scientific activity, while the organisation and funding of the institution in which the research is carried out are irrelevant (recital 42 in the preamble to Directive 29/2001). … The fact that the dataset – as the plaintiff claims – is also used by commercial companies for the training or further development of their artificial intelligence systems is irrelevant because the research of commercial companies is still research – even if not as such within the meaning of the German Federal Data Protection Act.’

Web scraping is not, per se, prohibited

A third very interesting part of the German ruling concerns the lawfulness of web scraping even on content for which one does not have a licence.

The court holds that, again within the limits of the non-profit research purpose, intellectual property rights are not infringed by the mere compilation of a dataset because it is not automatically certain that something workable will be built on that dataset, nor can one know what content will actually be generated.

Again, given the sensitivity of the issue, it is appropriate to quote verbatim from the passage of the decision: ‘it is further argued that AI web scraping concerns the intellectual content of works used for training purposes and, ultimately, the creation of identical or similar competing products (to those of the rights holder, nda) …, according to the Chamber, this argument does not distinguish rigorously enough between: . . the creation of a dataset … the subsequent training of the neural network with this dataset and … the subsequent use of the trained AI for the purpose of creating new image content. The latter functionality may already be the goal when the dataset is created. However, at the time of its construction, it is not possible to predict how successful the second step (the training of the model) will be, nor what specific content can be generated by the trained AI in the third step (in the application of the AI).’

Conclusions

The European future of AI-based technologies is compromised by a stalemate between the (many) actors who own the data and the (few) private actors who have the technologies to turn it into value.

The former do not want to give away data and information for free , while the latter claim to appropriate for free what they need for their own purposes.

It is as if the inhabitants of a village individually possess raw materials that only acquire value if they are pooled, but which they are unable to exploit; while on the other hand there are those who possess the tools to profit from those raw materials, but who do not want to share the results of their transformation with those who produce them.

If the German approach were to be consolidated and spread, perhaps the stalemate could be overcome by enhancing more the role of the common interest in a geo-economic and geopolitical context where this issue has not been at the centre of agendas for some time.

*Translation from German is not official

Leave a Reply

Your email address will not be published. Required fields are marked *