On 2 July, the French data protection supervisory authority – Commission Nationale de l’Informatique et des Libertés (CNIL) – launched a new public consultation on the development of AI systems. The public consultation is on (i) a new series of how-to sheets aimed at providing clarifications and recommendations with respect to seven issues related to the development of AI and data protection and (ii) a questionnaire on applying the GDPR to AI models trained with personal data. Below we set out a summary of the main takeaways.
Background
The new consultation follows on from a first series of recommendations published by the CNIL on 8 April after a first public consultation, during which the CNIL received 43 contributions from stakeholders in the AI ecosystem. The first set of recommendations answered questions relating to the application of the principles of purpose, minimisation and retention period for the development of AI systems, and clarified certain rules applicable to scientific research, the re-use of databases or the realization of data protection impact assessments. In particular, the first series of recommendations provided answers in the form of a first set of fact sheets and work sheets on how to carry out the following:
- determine the applicable legal regime;
- define a purpose;
- determine the legal qualification of the actors;
- define a legal basis;
- perform tests and verifications for data reuse;
- carry out an impact assessment if necessary;
- consider data protection when making system design choices;
- take data protection into account in data collection and management.
How-to sheets on the development of artificial intelligence systems (second series)
As part of the new consultation, the CNIL is seeking feedback on a new series of how-to sheets, which supplements the first set of recommendations described above. This new series focuses on the following seven issues:

Legal basis for legitimate interest and development of AI systems. The CNIL recognises that legitimate interest is the most common legal basis for the development of AI systems, which requires the establishment of a risk assessment for individuals and may require specific implementation conditions to protect individuals and their data. In the how-to sheet, the CNIL provides helpful examples of interests that could be presumed legitimate for the development of AI systems (e.g., to carry out scientific research, develop new systems and functionalities for users of a service or improve a product or service to increase its performance) and notes that commercial purpose of the development of an AI system is “not in itself contradictory to the use of the legal basis of legitimate interests”. The CNIL also proposes concrete steps for data controllers to fulfil the legitimate interest condition that the objective pursued does not threaten individuals’ rights, including when using web scraping techniques or publishing AI models in open source (see sheets 2 and 3 below). The CNIL’s acknowledgment that legitimate interests could be relied on as a lawful basis for the development/deployment of AI models/systems is helpful as it demonstrates that the CNIL – unlike some data protection authorities – does not interpret the notion of “legitimate interests” narrowly. It also confirms that controllers can rely on it also in instances where the purpose of processing is commercial (see for example here the EC’s letter criticising the Dutch DPA’s position that legitimate interests could not be used for purely commercial interests).

Legitimate interest: focus on open-sourcing models. The CNIL recognises in this how-to sheet that open source models in AI can have significant benefits for both the controller and individuals whose data is used in the development/deployment phase or as users, including by increasing transparency on the functioning of AI models/systems and allowing for their discussion/peer review. However, open source also accentuates certain risks such as those of malicious use or relating to security. The CNIL therefore recommends certain measures which should be implemented to be able to take into account open source distribution in the assessment of a legitimate interest basis, including a sufficient level of transparency, legal measures (e.g. restrictive licenses) to limit model reuse, technical data security measures and measures to ensure data subjects’ effective information and exercise of rights.

Legitimate interest: focus on web scraping. In the absence of a specific legal framework on web scraping, the CNIL recalls in this how-to sheet the controllers’ obligations and conditions which need to be satisfied to process data as part of web scraping to develop AI systems. In particular, the how-to sheet notes that (i) certain measures are mandatory under the principle of data minimisation, including the definition of precise collection criteria in advance and the application of filters to exclude the collection of unnecessary data categories, and (ii) other additional safeguards are generally required to ensure the balance necessary under the legitimate interest basis, including the exclusion of data collection from certain websites that contain particularly intrusive data or which clearly object to web scraping/reuse for AI training, the creation of a “push-back list”, an option for data subjects to object and anonymisation or pseudonymisation processes. In what could be seen as an important divergence from the GDPR, the CNIL also proposes setting up a centralised voluntary register of companies that use scraping tools for data collection in order to inform and enable data subjects to exercise their GDPR rights with the controller. However, the fact that the CNIL does not prohibit the use of web scraping techniques for GenAI purposes is helpful and resonates with other recent guidelines/opinions issued across Europe; see in particular the EDPB Report of the work undertaken by the ChatGPT Taskforce (here) and the first chapter of the UK ICO GenAI consultation paper (here).

Informing data subjects. The CNIL provides guidance in this how-to sheet on (i) what information should be provided to data subjects by companies that use personal data to develop AI systems (both generally and in the case of indirect data collection, and in particular where re-using a dataset or an AI model subject to the GDPR and where scraping websites), including model-specific information, (ii) when such information should be provided, generally inviting organisations, as a matter of good practice, to respect a “reasonable period of time” between informing individuals that their data are contained in a training dataset and the training of a model, (iii) how to provide the information, including measures to ensure the accessibility and intelligibility of the information, (iv) derogation cases where the provision of information is not mandatory (e.g., where the data subject has already obtained the information or where it would require disproportionate effort), and (v) best practices for increased transparency on the development processing (e.g., publishing any DPIA and documentation concerning the dataset establishment and development process). These do not materially depart from the previous European guidance/case law on transparency requirements under the GDPR, including in the context of GenAI.

Respecting and facilitating the exercise of data subjects’ rights. In this how-to sheet, the CNIL provides practical guidance on (i) how to respond to data subject rights with respect to both the training dataset and the AI model/system – e.g., right of access, right to receive a copy of the learning data, right to supplement or rectify training data, rights to rectification, opposition or erasure on the model – including to address difficulties in identifying the data subjects and with respect to re-training and (ii) cases in which derogations from the exercise of rights on datasets or on the AI model could apply.

Annotating data. The CNIL also provides guidance on annotating data (i.e. the process of assigning labels to each of the data that will serve as a “ground truth” based on which the model will learn to process, classify, or discriminate data). This how-to sheet focuses on (i) the stakes of annotation for people’s rights and freedoms, including the principles of minimisation and accuracy, (ii) measures to ensure the quality of the annotation (e.g., defining an annotation protocol and involving an ethics referent), (iii) how to respect individuals’ rights to be informed of the annotation operations and enable them to exercise their rights over annotations and (iv) additional considerations for annotation from sensitive data.

Ensuring the safe development of an AI system. With respect to the obligation under the GDPR to ensure the security of processing of personal data, the CNIL sets out in this how-to sheet (i) the methodological approach to manage the security of the development of an AI system, (ii) the main security objectives to be pursued when developing an AI system, (iii) the risk factors to be taken into account, some of which are AI-specific, and (iv) recommended measures to make the level of residual risk acceptable.
Questionnaire on the application of the GDPR to AI models
The CNIL also published as part of the consultation a questionnaire on the application of the GDPR to AI models. The questionnaire focuses in particular on the risks of memorisation, regurgitation and extraction of personal data from AI models, which may give rise to the possibility of (re-)identifying the individuals whose data were used to train the model. The CNIL invites stakeholders to provide their insights through the questionnaire on the conditions under which AI models can be considered anonymous or must be regulated by the GDPR and on the consequences of such a qualification.
Next steps
The new consultation will remain open until 1 September 2024. The contributions will be then analysed at the end of the public consultation to allow the publication of the final recommendations on the CNIL website in the course of 2024.