LAION vs Kneschke: Building public datasets is covered by the TDM exception

October 11, 2024

by Paul Keller

This article was first published on the Open Future blog on October 10, 2024.

Two weeks ago, the Landgericht Hamburg decided the case LAION vs Kneschke. The latter – a photographer – had sued LAION for including one of his pictures in the LAION-5B training data set, claiming that this amounted to copyright infringement. The court has now ruled that LAION did not infringe Kensche’s copyright because the use of his photo was covered by the German implementation of the text and data mining (TDM) exception for purposes of scientific research introduced in Article 3 of the CDSM Directive. This first court test of the EU legal framework for AI training is good news for LAION and anyone interested in training data transparency in general.

The ruling confirms that the TDM exceptions introduced in the 2019 Copyright Directive do, in fact, cover the use of copyrighted works in the context of training (generative) AI models. More importantly, it is good news because it recognizes the important role of nonprofit providers of public training datasets and suggests a way to ensure greater training data transparency. For more details on the background of the case, see my earlier post here.

The issue at stake

The facts of the case are rather convoluted. Kneschke brought the case for unauthorised use of his photo against LAION, a German non-profit organisation that maintains and makes available online the LAION 5B training dataset. The LAION 5B dataset does not make images themselves available, it only contains image descriptions and hyperlinks to where the images could be found when the dataset was created. In other words, Kneenschke could not sue LAION for making his images available because the LAION 5B dataset does not make any image files available (a fact that some commentators try to ignore by relying on rather expansive interpretations of the ever-expanding CJEU jurisprudence on hyperlinks). Instead, Kneschke argued that LAION made reproductions of his images in a process, where an AI model is used to verify that image descriptions published alongside the source images that LAION had obtained from the Internet actually described the contents of the associated image file.

In this situation, the court correctly found that the reproductions in the context of building the dataset were acts of text and data mining and thus fell within one of the TDM exceptions. The court further held that LAION, as a non-profit organisation with the objective of advancing the scientific understanding of artificial intelligence, falls within the scope of the TDM exception for scientific research in Article 3 CDSM (implemented in Germany by Art. 60 UrHG). It rejected the argument that LAION is not sufficiently independent from commercial entities (in this case Stability.ai) that train AI models based on the LAION 5B dataset and make them available for commercial use.

It is important to note that the court granted LAION legal protection under the Article 3 TDM exception because LAION contributes to scientific research. Given that LAION is not a typical scientific research organisation (a university or the like), but rather a non-profit organisation with a mission to advance scientific (and thus public) understanding of training datasets – a key element of the mostly opaque AI model training process, this is very welcome. LAION and similar organisations like Common Crawl play a crucial role both in the AI training pipelines of (commercial) model developers and in the public understanding of what these models are trained on and how they work.

Here, the positive impact of LAION and Common Crawl cannot be underestimated. The fact that the LAION training dataset is publicly available has allowed creators to see the extent to which their works are being used to train AI models (and subsequently sue some of the model providers), it has allowed startups to develop systems that allow creators and rights holders to register opt-outs, it has allowed researchers to understand the prevalence of Child Sexual Abuse Material (CSAM) and other types of highly problematic content in training datasets (which in turn forced LAION to subsequently remove 1000s of instances of CSAM from the dataset and re-release it), and it has allowed numerous researchers to understand the biases and other problematic patterns in the dataset.

All of this is only possible because LAION is making this dataset publicly available, which is a tremendous service to public understanding of this rapidly evolving technology. It is worth remembering that without the LAION 5B, there would have been nothing to prevent AI companies from building their own proprietary datasets that would have raised all the same questions without exposing any of them to outside scrutiny. In fact, it’s probably safe to assume that many of the larger AI companies have similar, proprietary datasets that are shielded from any public or scientific scrutiny. Ironically, Kenschke would have had no way of knowing if any of his images were being used for AI training had LAION not published the dataset.

A layered understanding of AI training

In this situation, it is very welcome to see that the court has arrived at a layered understanding of the process of AI training in its ruling. As Andres Guadamuz points out in his analysis of the decision, the Court distinguishes between the following stages in the training of an AI model:

“the creation of a data set (which is the sole subject of the dispute here) that can also be used for AI training,
on the other hand, the subsequent training of the artificial neural network with this data set and
thirdly, the subsequent use of the trained AI for the purpose of creating new image content.”

Aș Andres further points out, only the first two are related to the actual training process: the third occurs after the model has been trained and deployed. However, by distinguishing between the creation of data sets and the subsequent training of models, the court provides an analytically useful framework for increasing public understanding of AI models.

For the reasons outlined above, it is in the public interest to allow non-profit scientific research organisations (in whatever form) to build public training datasets, even if those datasets can subsequently be used by for-profit entities. The layered perspective helps in understanding how copyright law should regulate this.

Ideally, the creation of public datasets would be fully covered by the Article 3 exception to encourage a much wider practice of making training datasets publicly available. Whether this is really possible under the current version of the exception is somewhat questionable as it only covers reproductions and does not allow for making available the works used in the context of the exception. As explained above, this was not an issue in the LAION case, but it would become an issue for other types of training datasets (such as Common Crawl and many of its derivatives) that contain portions of copyrighted works in the training dataset itself. In this context, it is interesting to note that a small number of EU Member States – including Slovenia and Bulgaria – have implemented Article 3 in a way that also covers the making available of the results of TDM.

But what about for profit model training?

Based on the layered understanding of the court, it also seems clear that Kenschke sued the wrong entity. Instead of directing his suit at LAION, he should have directed it at Stability.ai (or any other for-profit entity that uses the dataset to train its models). Based on the court’s analysis, and in particular the fact that the court considers the rights reservation in the terms of service of the platform where the photo in question was originally published to be machine-readable within the meaning of Article 4(3) of the CDSM Directive, it seems clear that Kneschke would have been more successful in such a scenario. Bringing a case against Stability would probably be a logical next step for Kneschke, although it would raise a different set of questions about applicable law given that Stability.ai is based outside of Europe.