Artificial intelligence (AI) has taken the world by storm and people’s feelings towards the technology range from fascination about its capabilities to grave concerns about its implications. Meanwhile, legislators across the globe are trying to wrap their heads around how to regulate AI. The EU has proposed the so-called AI Act which aims to protect European citizens from potential harmful applications of AI, while still encouraging innovation in the sector. The file, which was originally proposed by the European Commission in April of 2021 just entered into trilogues and will be hotly debated over the coming months by the European Parliament and Council.
One of the key issues for the discussions will most likely be how to deal with the rather recent phenomenon of generative AI systems (also referred to as foundational models) which are capable of producing various content ranging from complex text to images, sound computer code and much more with very limited human input.
The rise of generative AI
Within less than a year, generative AI technology went from having a select few, rather niche applications to becoming a global phenomenon. Perhaps no application represents this development like ChatGPT. Originally released in November 2022, ChatGPT broke all records by reaching one million users within just five days of its release with the closest competitors for this title, namely Instagram, Spotify, Dropbox and Facebook, taking several months to reach the same stage. Fast forward to today, approximately half a year later, and ChatGPT reportedly counts more than 100 million users.
One of the reasons for this “boom” of generative AI systems is that they are more than just a novelty. Some systems have established themselves as considerable competitors for human creators for certain types of creative expressions, being able to write background music or produce stock images that would take humans many more hours to create. In fact, the quality of the output of some systems is already so high while the cost of production is so low that they pose an existential risk to specific categories of creators, as well as the industries behind them.
But how do generative AI systems achieve this and what is the secret behind their ability to produce works that can comfortably compete with works of human creativity? Providing an answer to this question, even at surface level, is extremely difficult since AI systems are notoriously opaque, making it nearly impossible to fully understand their inner workings. Furthermore, developers of these systems have an obvious interest in keeping the code of their algorithm as well as the training data used secret. This being said, one thing is for certain: generative AI systems need data, and lots of it.
The pursuit of data
Creating an AI system is incredibly data intensive. Data is needed to train and test the algorithm throughout its entire lifecycle. Going back to the example of ChatGPT, the system was trained on numerous datasets throughout its iterations containing hundreds of gigabytes of data equating to hundreds of billions of words.
With so much data needed for training alone, this opens up the question how developers get their hands on this amount of information. As is fairly obvious by the sheer numbers, training data for AI systems is usually not collected manually. Instead, developers often rely on two sources for their data: curated databases which contain vast amounts of data and so-called web crawlers which “harvest” the near boundless information and data resources available on the open internet.
The copyright conundrum
Some of the data available in online databases or collected by web scraping tools will inevitably be copyrighted material which raises some questions with regards to the application of copyright in the context of training AI systems. Communia has extensively discussed the interaction between copyright and text and data mining (TDM) in our policy paper #15 but just as a short refresher about the clear framework established in the 2019 Copyright Directive:
Under Article 3, research organizations and cultural heritage institutions may scrape anything that they have legal access to, including content that is freely available online for the purposes of scientific research. Under Article 4, this right is extended to anyone for any purposes but rights holders may reserve their rights and opt out of text and data mining, most often through machine-readable means.
While this framework, in principle, provides appropriate and sufficient legal clarity on the use of copyrighted materials in AI training, the execution still suffers from the previously mentioned opacity of AI systems and the secrecy around training data as there is no real way for a rightsholder to check whether their attempt to opt out of commercial TDM has actually worked. In addition, there’s still a lot of uncertainty about the best technical way to effectively opt out.
Bringing light into the dark
Going back to the EU’s AI Act reveals that the European Parliament recognises this issue as well. The Parliament’s position foresees that providers of generative AI models should document and share a “sufficiently detailed” summary of the use of training data protected under copyright law (Article 28b). This is an encouraging sign and a step in the right direction. The proof is in the pudding, however. More clarity is needed with regards to what “sufficiently detailed” means and how this provision would look in practice.
Policy makers should not forget that the copyright ecosystem itself suffers from a lack of transparency. This means that AI developers will not be able – and therefore should not be required – to detail the author, the owner or even the title of the copyrighted materials that they have used as training data in their AI systems. This information simply does not exist out there for the vast majority of protected works and, unless right holders and those who represent them start releasing adequate information and attaching it to their works, it is impossible for AI developers to provide such detailed information.
AI developers also should not be expected to know which of their training materials are copyrightable. Introducing a specific requirement for this category of data adds legal complexity that is not needed nor advisable. For that and other reasons, we recommend in our policy paper that AI developers be required to be transparent about all of their training data, and not only about the data that is subject to copyright.
The fact that AI developers know so little about each of the materials that is being used to train their models should not, however, be a reason to abandon the transparency requirement.
In our view, those that are using publicly available datasets will probably comply with the transparency requirement simply by referring to the dataset, even if the dataset is lacking detailed information on each work. Those that are willing to submit training data with a data thrust that would ensure the accessibility of the repository for purposes of assessing compliance with the law would probably also ensure a reasonable level of transparency.
The main problem is with those that are not disclosing any information about their training data, such as OpenAI. These need to be forced to make some sort of public documentation and disclosure and at least need to be able to show that they have not used copyrighted works that have an opt-out attached to it. And that begs for the question: how can creators and other right holders effectively reserve their training rights and opt-out of the commercial TDM exception?
Operationalizing the opt-out mechanism
In our recommendations for the national implementation of the TDM exceptions we suggested that the proper technical way to facilitate web mining was by the use of a protocol like robot.txt which creates a binary “mine”/“don’t mine” rule. However, this technical protocol has some significant limitations when it comes to its application in the context of data mining for AI training data.
Therefore, one of the recommendations in our policy paper is for the Commission to lead these technical discussions and provide guidance on how the opt-out is supposed to work in practice to end some of the uncertainty that exists among creators and other rights holders.
In order to encourage a fair and balanced approach to both the opt-out and the transparency issues, the Commission could convene a stakeholder dialogue and include all affected parties, namely AI developers, creators and rights holders as well as representatives of civil society and academia. The outcome of this dialogue should be a way to operationalise the opt-out system itself and the transparency requirements that will uphold such a system without placing a disproportionate burden on AI developers.
Getting this right would provide a middle ground that allows creators and other rights holders to protect their commercial AI training rights over their works while encouraging innovation and the development of generative AI models in the EU.