This op-ed originally appeared in The Hill on April 22, 2023.
On May 11, the European Parliament voted in favor of the EU Artificial Intelligence Act. With the sudden introduction and popularity of generative AI tools such as ChatGPT, DALL-E, Google Bard and Stable Diffusion, it will come as no surprise that the European Commission made some tweaks to the document in the last few weeks to address the technologies.
Most notably, the amendments included new transparency and disclosure requirements for large language models, a type of AI algorithm classified under “general-purpose AI systems” that use large datasets and machine learning to understand and generate content.
Generative AI technologies such as ChatGPT and DALL-E are trained on datasets consisting of publicly available data scraped from the internet. As such, artists and creators have for some time now been raising concerns regarding large language systems “stealing” original work and, in turn, the lack of compensation from the companies that developed those systems.
The aim of this regulation is to protect creators and copyright holders. But it may inadvertently impair American AI firms operating in the European market. As such, American AI companies looking to offer services to EU citizens should be aware and explore alternative methods of intentional, and careful, data collection.
Article 28b 4(c) of the EU AI Act states that providers of generative AI systems shall “without prejudice to national or Union legislation on copyright, document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law.” The problem is that each segment of data and generated content, or each single image, is impossible to identify. GPT-3, for example, was trained on 45 terabytes of text data; such large and diverse datasets would render tracing specific data segments infeasible.
In a September 2022 Forbes interview, MidJourney founder David Holz stated that “there isn’t really a way to get a hundred million images and know where they’re coming from.” He went on, “There’s not a registry.” If this law passes, AI companies in the U.S. and other countries around the world could be in trouble. Non-compliance with the law could lead to fines of up to 30 million euros or 6 percent annual revenue turnover, whichever is greater. Consequently, it strongly behooves AI developers to devise ways to document training data.
A quick glance at this section of the act suggests that copyright owners and original creators could perhaps be fairly compensated for their original works. But the broad language about the obligations makes it difficult to discern how detailed companies will have to be in their summaries. Thus, it is unclear how creators will know for sure whether their work is being used in the training dataset. This could result in unnecessary, baseless lawsuits, particularly if complying companies are over-inclusive in their “detailed summary of the use of training data.” As Riede, Pratt, and Hofer point out, “it is hard to know what constitutes a ‘sufficiently detailed summary of the use of training data’ and how often that summary needs to be updated.” Additional specifics should be added to this section of the act in order to avoid such occurrences.
Furthermore, the discrepancies between EU and U.S. copyright laws would likely result in confusion and inconsistencies among companies trying to comply. The EU does not have a copyright registry, and anyone who “create[s] literary, scientific and artistic work…automatically ha[s] copyright protection, which starts from the moment [they] create [their] work.” That means that companies creating large language model AI must be wary of any content they pull from EU creators. The U.S., on the other hand, has a formal copyright registration process; not everyone is eligible. In particular, on March 16, 2023, the U.S. Copyright Office released a statement that AI-generated works were not eligible for copyright. This statement underscores the considerable difficulty of determining which types of content are truly protected under copyright law.
Various lawsuits have already begun over OpenAI, Microsoft, and Github, which were sued last November for copyright violations. AI art tools such as Stable Diffusion and MidJourney have recently been the target of copyright lawsuits as well. If the EU AI Act comes into effect, these companies will likely be hit with even more frequent lawsuits with far heftier penalties.
As the EU AI Act becomes reality, American AI companies must begin to carefully study the implications for their own companies. It remains to be seen whether the transparency requirements in the EU AI Act will spur new innovation in efforts to comply. A more flexible approach to regulating generative AI models would be preferable to U.S. businesses, as well as the western market overall. The divergences between EU and U.S. copyright laws do not create a favorable landscape for firms trying to collaborate in both regions, and the latest amendments to the AI Act worsen this dynamic.
AI firms will have to steel themselves, as these looming EU regulations threaten to reshape the very foundation of their existence.