With Christmas fast approaching, on December 8, the European Parliament wrapped up one of its biggest presents of the mandate: the AI Act. A landmark piece of legislation with the goal of regulating Artificial Intelligence while encouraging development and innovation. In sticking with the holiday theme, the last weeks of the negotiations have included everything from near-breakdowns of the discussions, not too dissimilar to the explosive dynamics of festive family gatherings, and 20+ hour trilogue meetings, akin to last-minute christmas shopping. But alas, it is done.
One of the key priorities for COMMUNIA was the issue of transparency of training data. In April, we issued a policy paper calling the EU to enact a reasonable and proportional transparency requirement for developers of generative AI models. We have followed the work up with several blogposts and a podcast, outlining ways to make the requirement work in practice, without placing a disproportionate burden on ML developers.
From our perspective, the introduction of some form of transparency requirement was essential to uphold the legal framework that the EU has for ML training, while ensuring that creators can make an informed choice about whether to reserve their rights or not. Going by leaked versions of the final agreement, it appears that the co-legislators have come to similar conclusions. The deal introduces two specific obligations on providers of general-purpose AI models, which serve that objective: an obligation to implement a copyright compliance policy and an obligation to release a summary of the AI training content.
The copyright compliance obligation
In a leaked version, the obligation to adopt and enforce a copyright compliance policy reads as follows:
[Providers of general-purpose AI models shall] put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies where applicable, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790
Back in November, we suggested that instead of focussing on getting a summary of the copyrighted content used to train the AI model, the EU lawmaker should focus on the copyright compliance policies followed during the scraping and training stages, mandating developers of generative AI systems to release a list of the rights reservation protocols complied with during the data gathering process. We were therefore pleased to see the introduction of such an obligation, with a specific focus on the opt-outs from the general purpose text and data mining exception.
Interestingly, the leaked version contains a recital on which the co-legislators declare their intent to apply this obligation to “any provider placing a general-purpose AI model on the EU market (…) regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these foundation models take place”. While one can understand why the EU lawmakers would want to ensure that all AI models released in the EU market respect these EU product requirements, the fact that these are also copyright compliance obligations, which apply previously to the release of the model in the EU market, would raise some legal concerns. It is not clear how the EU lawmakers intend to apply EU copyright law when the scrapping and training takes place outside the EU borders without an appropriate international legal instrument.
The general transparency obligation
The text goes on to require that developers of general-purpose AI models make publicly available a sufficiently detailed summary about the AI training content:
[Providers of general-purpose AI models shall] draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office
While we have previously criticized the formulation “sufficiently detailed summary” due to the legal uncertainty it could cause, having an independent and accountable entity draw-up a template for the summary (as we defended in here) could alleviate some of the vagueness and potential confusion.
We were also pleased to see that the co-legislators listened to our calls to extend this obligation to all training data. As we have said before, on the one hand introducing a specific requirement only for copyrighted data would add unnecessary legal complexity, since ML developers would first need to know which of their training materials are copyrightable, and on the other hand knowing more about the data that is feeding models that can generate content is essential for a variety of purposes, not all related to copyright.
We should also highlight that the co-legislators appear to have a similar understanding to ours in terms of how compliance with the transparency requirement could be achieved when the AI developers use publicly available datasets. In the leaked version there is a clarifying recital stating that “(t)his summary should be comprehensive in its scope instead of technically detailed, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.”. When the training dataset is not publicly accessible, we maintain that there should be a way to ensure conditional access to the dataset, namely through a data trust, to confirm legal compliance.
Taking these amendments into account, the compromise found by the co-legislators manages to strike a good balance between what is technically feasible and what is legally necessary.