Training AI models on Synthetic Data: No silver bullet for IP infringement risk in the context of training AI systems (Part 4 of 4)

By Gareth Kristensen, Angela Dunning, Gaia Shen, Jan-Frederik Keustermans, Prudence Buckland & Alix Anciaux on February 5, 2024

This is the fourth and final part of our series on using synthetic data to train AI models. See here for Parts 1, 2 and 3.

What other key legal topics should I consider when using synthetic data to train an AI model?

In addition to IP infringement risks, there are of course plenty of other interesting legal questions which may arise in relation to the use of synthetic data to train AI models. These topics are not covered in this series in further detail, but it is worth noting at least that the use of synthetic data can also mitigate certain specific risks under applicable data protection and privacy laws. On the other hand, using synthetic data to train an AI model could give rise to increased product liability risk, including under the proposed AI Product Liability Directive.[1]

AI models trained on synthetic data containing flaws (be it because the data was generated by a flawed upstream AI model and used to train a cluster of downstream AI models, the real world data was a flawed sample, or the data set itself contained errors) will inevitably compound defective product liability risk. Depending on the use case, the risk may be more or less significant.

But either way, this places a premium on the authentication of reliable and trustworthy data (real world or synthetic) that has been appropriately audited (including in relation to having no latent bias embedded in the data set) as suitable training material for the relevant training model and outputs envisaged.[2]

Conclusion: Is an AI trained with synthetic data bullet-proof from an IP infringement perspective?

No. Synthetic data has the potential to allow developers to train GenAI models effectively while mitigating, or shifting further up the data input line, certain legal risks associated with the use of real-world data to train such models — at least to an extent under the evolving EU legal framework including the Copyright Directive and the proposed EU AI Act.

But unsurprisingly, synthetic data cannot fully shield developers and other players active in this field from IP infringement risks linked to the “real-world” application of existing and forthcoming laws and regulations, including with respect to the training of the synthetic data-generating AI systems and the generation of outputs that are potentially infringing.

In addition, using synthetic data to train an AI model may not be appropriate or effective in all circumstances. Due to issues such as potential embedded systemic errors in the synthetic data sets or the risk of “dog-fooding” (whereby the circular process of AI models training on and regurgitating synthetic data sets may lead to irreparable defects to the technology over time), using synthetic training data could also give rise to other ethical, technological (including potential “model collapse”[3]) and legal (including product liability) risks and challenges for AI developers. Such risks and challenges should be kept under review as more and more online content is generated by (or with the assistance of) GenAI, increasing the likelihood of training on synthetic data, even if inadvertently.

[1] European Commission, Proposal for a directive of the European Parliament and of the Council on adapting non-contractual civil liability rules to artificial intelligence (AI Liability Directive) of September 22, 2022, COM/2022/496 final.

[2] For a more detailed analysis of the proposed AI Product Liability Directive, see our earlier blog post here.

[3] See e.g., I. Shumailov, Z. Shumaylov et al, “The Curse of Recursion: Training on Generated Data Makes Models Forget” (available at https://arxiv.org/abs/2305.17493).