Training AI models on Synthetic Data: No silver bullet for IP infringement risk in the context of training AI systems (Part 2 of 4)

By Gareth Kristensen, Angela Dunning, Gaia Shen, Prudence Buckland, Jan-Frederik Keustermans & Alix Anciaux on December 19, 2023

This second part of our four-part series on using synthetic data to train AI models explores how the use of synthetic data training sets may mitigate copyright infringement risks under EU law.

The EU copyright framework applied to training AI models

Using synthetic data to train AI models has the potential to help overcome several legal hurdles faced by AI developers. This is mainly because, as the law stands today, synthetic data would likely not itself be eligible for copyright protection in the EU (although the law is still evolving on this point and the position may eventually vary under national law). Under EU law, a work must be the “author’s own intellectual creation” for copyright to subsist in it. AI-generated works such as synthetic data may be found not to meet this standard since they arguably do not have a human author.[1]

However, in order to produce synthetic data, an AI model must first still be trained on real-world data samples, whereby the model learns the relevant patterns, correlations and statistics of the real-world data to be able to generate synthetic data which resembles as closely as possible a hypothetical real-world dataset.[2] Once the process of training the synthetic data-generating model is complete, the model should be able to produce statistically identical synthetic data (i.e., even though the underlying data would be different, the distributions of the two data sets will be indistinguishable based on statistical inference, so without any significant differences in outcome when subjected to statistical analyses). This data set can then be further fine-tuned as needed through human oversight.

Another way to generate synthetic data is by using a short-cut, whereby the AI model strips, augments or scrambles real-world data (with human oversight), which leads to an even closer link between the initial real-world sources and the ultimate synthetic data output.[3]

Considering the above, even though training AI models on synthetic data may avoid certain of the issues related to training on “real-world” data, the very act of creating that synthetic data may itself require training on real-world sources, triggering those same issues. In other words, synthetic data may not be a complete solution for the potential IP infringement risks identified in Part 1 of this Series, rather, a way to push any issues further up the data input line.

Residual copyright infringement risks

Synthetic data itself may also not be entirely shielded from IP infringement claims to the extent it retains in substantial part the initial real-world data on which the generating AI system was trained.

Perhaps some of these risks will remain largely theoretical in practice, although they may soon be exacerbated by the fact that AI models can be the target of cyber-attacks, including ever more sophisticated attacks using AI-empowered tools developed specifically to extract source data from AI models.[4] So even though it may be impossible for a human to “uncover” the potentially copyrighted materials on which an AI model (or the AI model which supplied the synthetic training data) was trained, it may become easier to uncover these source materials by piercing the initial “synthetic data shield” further up the input line.

Finally, an AI model trained on synthetic data may still generate outputs that are potentially infringing on copyright-protected works depending on the similarity of such outputs to those original works. Whether this specific risk would be primarily for the developers or the users remains to be litigated or regulated, but it cannot be entirely eliminated only by using synthetic data for training.

Coming up in this Series

Stay tuned until after the New Year’s break: Part 3 of this series will cover the interplay between synthetic data training sets, the EU Copyright Directive and the forthcoming EU AI Act. Finally, Part 4 will explore other key legal topics to be considered when using synthetic data to train an AI model.

You can read part 1, here.

[1] CJEU case law has confirmed that copyright protection requires a minimum level of creative human input and that it must reflect the author’s personality. See Case C-145/10 – Painer, para 92: “[b]y making those various choices, the author of a portrait photograph can stamp the work created with his ‘personal touch’”; and Case C-683/17 – Cofemel, para. 30: “if a subject matter is to be capable of being regarded as original, it is both necessary and sufficient that the subject matter reflects the personality of its author, as an expression of his free and creative choices”. In the United States, the US Copyright Office (“USCO”) has thus far refused to register works generated using AI tools, and at least one U.S. court has held that a work “autonomously generated by an AI system” without human involvement is not protectible by copyright. Thaler v. Perlmutter, No. 22-1564 (D.D.C. Aug. 18, 2023), available at Thaler v. Perlmutter, Dist. Court, Dist. of Columbia 2023 – Google Scholar. This decision is currently on appeal, and no U.S. court to date has weighed in how much human creativity and involvement is necessary to secure copyright protection for a work created by a human aided by GenAI.

[2] MIT Sloan, “The Real Deal About Synthetic Data” (October 20, 2021).

[3] Depending on the processes used and similarity of the data created, this method may be vulnerable to a claim that the resulting data infringes on the exclusive right to make derivative works of any copyrighted works included in the “scrambled” real-world dataset.