Before the trilogue, COMMUNIA issued a statement, calling for a comprehensive approach on the transparency of training data in the Artificial Intelligence (AI) Act. COMMUNIA and the co-signatories of that statement support more transparency around AI training data, going beyond data that is protected by copyright. It is still unclear whether the co-legislators will be able to pass the regulation before the end of the current term. If they do, proportionate transparency obligations are key to realising the balanced approach enshrined in the text and data mining (TDM) exception of the Copyright Directive.
How can transparency work in practice?
As discussed in our Policy Paper #15, transparency is key to ensuring a fair balance between the interests of creators on the one hand and those of commercial AI developers on the other. A transparency obligation would empower creators, allowing them to assess whether the copyrighted materials used as AI training data have been scraped from lawful sources, as well as whether their decision to opt-out from AI training has been respected. At the same time, such an obligation needs to be fit-for-purpose, proportionate and workable for different kinds of AI developers, including smaller players.
While the European Parliament’s text has taken an important step towards improving transparency, it has been criticised for falling short in two key aspects. First, the proposed text focuses exclusively on training data protected under copyright law which arbitrarily limits the scope of the obligation in a way that may not be technically feasible. Second, the Parliament’s text remains very vague, calling only for a “sufficiently detailed summary” of the training data, which could lead to legal uncertainty for all actors involved, given how opaque the copyright ecosystem itself is.
As such, we are encouraged to see the recent work of the Spanish presidency on the topic of transparency, improving upon the Parliament’s proposed text. The presidency recognises that there is a need for targeted provisions that facilitate the enforcement of copyright rules in the context of foundation models and proposes that providers of foundation models should demonstrate that they have taken adequate measures to ensure compliance with the opt-out mechanism under the Copyright Directive. The Spanish presidency has also proposed that providers of foundation models should make information about their policies to manage copyright-related aspects public.
This proposal marks an important step in the right direction by expanding the scope of transparency beyond copyrighted material. Furthermore, requiring providers to share information about their policies to manage copyright-related aspects could provide important clarity as to the methods of opt-out that are being respected, empowering creators to be certain that their choices to protect works from TDM are being respected.
In search of a middle ground
Unfortunately, while the Spanish presidency has addressed one of our key concerns by removing the limitation to copyrighted material, ambiguity remains. Calling for a sufficiently detailed summary about the content of training data leaves a lot of room for interpretation and may lead to significant legal uncertainty going forward. Having said that, strict and rigid transparency requirements which force developers to list every individual entry inside of a training dataset would not be a workable solution either, due to the unfathomable quantity of data used for training. Furthermore, such a level of detail would provide no additional benefits when it comes to assessing compliance with the opt-out mechanism and the lawful access requirement. So what options do we have left?
First and foremost, the reference to “sufficiently detailed summary” must be replaced with a more concrete requirement. Instead of focussing on the content of training data sets, this obligation should focus on the copyright compliance policies followed during the scraping and training stages. Developers of generative AI systems should be required to provide a detailed explanation of their compliance policy including a list of websites and other sources from which the training data has been reproduced and extracted, and a list of the machine-readable rights reservation protocols/techniques that they have complied with during the data gathering process. In addition, the AI Act should allocate the responsibility to further develop transparency requirements to the to-be-established Artificial Intelligence Board (Council) or Artificial Intelligence Office (Parliament). This new agency, which will be set up as part of the AI Act, must serve as an independent and accountable actor, ensuring consistent implementation of the legislation and providing guidance for its application. On the subject of transparency requirements, an independent AI Board/Office would be able to lay down best-practices for AI developers and define the granularity of information that needs to be provided to meet the transparency requirements set out in the Act.
We understand that the deadline to find an agreement on the AI Act ahead of the next parliamentary term is very tight. However, this should not be an excuse for the co-legislators to rush the process by taking shortcuts through ambiguous language purely to find swift compromises, creating significant legal uncertainty in the long run. In order to achieve its goal to protect Europeans from harmful and dangerous applications of AI while still allowing for development and encouraging innovation in the sector, and to potentially serve as model legislation for the rest of the world, the AI Act must be robust and legally sound. Everything else would be a wasted opportunity.