COMMUNIA submission to the multi-stakeholder consultation to future-proof the AI Act

September 17, 2024

by Teresa Nobre

The European AI Office’s public consultation Future-Proof AI Act: Trustworthy General-Purpose AI wraps up tomorrow. This multi-stakeholder consultation is aimed at informing the drafting process of the first Code of Practice, which will detail the AI Act rules for providers of general-purpose AI models and general-purpose AI models with systemic risks.

Earlier this month COMMUNIA expressed its interest to participate in the drawing-up of the first General-Purpose AI Code of Practice. On September 16th, we submitted our response (download as a PDF file) to the consultation. Our answers focus on the questions related to the transparency and copyright-related rules applicable to general-purpose AU models, specifically the questions related to the copyright compliance policy and the template for the summary about the AI model training data. Below we highlight some of our responses to the consultation.

Copyright compliance policy

The AI Act requires providers of general-purpose AI models to put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with text and data mining rights reservations expressed pursuant to Article 4(3) of the DSM Directive.

Legal basis

When looking into the main elements of the copyright compliance policy, the first aspect that needs to be taken into account is that AI training data includes different categories of content. AI model training data can include in-copyright content used under a copyright exception, content subject to a licence agreement, or content made available under an open licence (such as a CC licence). Training data can also include Public Domain content, comprising both subject matter excluded from copyright protection and content that is no longer protected by copyright.

The policy will need to consider those differences when determining the legal basis for scrapping and using content for AI training purposes. Needless to say, several compliance challenges are expected at this stage, namely because there is no public repository of Public Domain works and openly licensed works.

Opt-out identifiers

When assessing compliance with opt-outs made by right holders under Article 4(3) of the DSM Directive, model providers are expected to adopt measures to ensure that they recognize different machine-readable identifiers used to opt-out and that they are able to identify if the opt-out refers to all general purpose text and data mining or only certain mining activities (e.g. for purposes of training generative AI models).

Right holders may use domain-based identifiers (e.g. robots.txt or ai.txt) or identifiers that apply to individual works or files (e.g. Coalition for Content Provenance and Authenticity (C2PA) or International Standard Content Code (ISCC)). They can also resort to tools specifically designed to register and aggregate right holder opt-outs (e.g. haveibeentrained.com). The choice between identifiers will depend on the right holder’s distribution strategies. Location-based identifiers can only be set by entities that have control over the domains in question, which may not be the actual right holders. Unit-based identifiers allow right holders to reserve rights in a more granular way and regardless of where the files are hosted, being better suited for works that circulate as independent media files.

It goes without saying that the lack of convergence of identifiers raises legal uncertainty and increases costs. In order to overcome these challenges, the opt-out processes should be streamlined by agreeing on a small number of standardised identifiers and on a granular vocabulary for opting-out, and by creating a public registry for recording opt-outs.

Opt-out effect

The effect of the opt-out is another element that needs to be taken into consideration in the copyright compliance policy. An opt-out from AI training shall force the AI model provider to remove the training data from its training data sets and stop using the opted-out work to train new AI models, but it shall not require the AI model that has already been trained to unlearn the work. Assuming that the model does not store any copies of the works and that all copyright-relevant acts (i.e. reproduction and extraction of copyrighted content) have already occurred, the opt-out shall not affect the use of that specific AI model.

This means that the compliance policy should record the starting date of the training, after which new opt-outs will no longer have to be complied with. This opt-out cut-off date shall be publicly communicated, when releasing the model.

Users rights safeguards

Finally, the compliance policy needs to include strong user rights safeguards to mitigate the risks to freedom of expression and the right to information of the users of AI systems. This requirement should apply when the model provider is also a system provider, and the provider deploys measures to prevent the generation, in the outputs of the model, of copyright infringing content (e.g. automatic content recognition and filtering tools), regardless of whether these measures are based on a legal requirement, contractual obligation or as part of a voluntary commitment.

As we learned during the discussions surrounding the implementation of Article 17 of the DSM Directive, existing tools are efficient at identifying content, but incapable of understanding the context in which content is used and, thus, often fail to recognise perfectly legitimate uses, such as quotations and parodies. Automated measures to prevent infringements must thus comply with strong users safeguards, including ex-ante and ex-post safeguards, following the blueprint offered by the best national implementations of Article 17.

Summary about AI training data

The AI Act requires providers to draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.

As we stated before, we support a proportionate, realistic, and practical approach to the categories of information that should be presented in the summary. In our view, the summary should include a listing of the primary data collections or sets used and a description of all other data sources used in all stages of model training.

The description of the data sources should contain sufficient technical detail to provide meaningful and comprehensive information for all relevant stakeholders (e.g. creators, users, researchers, data subjects), taking into account that the categories of rights that justify access to this information include not only copyright, but also freedom of expression and information, research rights, privacy and data protection rights, etc.

For each data source, there should be an indication of the type and nature of data, time of collection of the data, and opt-out cut off dates. Further detail will depend on the information source used. For instance, for scraped online data, the description should include information about the crawling methodology used to obtain the data and a weighted list of the top domains. For licensed data, it would be important to include information about the licensor and also whether the licence is exclusive.

The summary must also indicate the legal measures taken to address risks to parties with legitimate interests, including measures to identify and respect opt-outs and users rights safeguards.

Finally, business confidentiality and trade secrets of providers should not be used to prevent the disclosure of any of this basic information.