In the recent years, artificial intelligence has taken an increasingly important part of our life. With the popularisation of generative models, which create images from text or conversational AI such as ChatGPT, Bert or LLaMA and their integration in every day devices like our smartphones or computers, it is hard to avoid it. Nevertheless, multiple works starting as far back as 2013 have pointed deficiencies or fragilities on the different AI architectures used. The most common one is adversarial attacks, i.e. a small perturbation, imperceptible by human eyes, on the inputs changing drastically the decision of the networks. It has been proven possible on numerous tasks such as image classification but also voice or text recognition tasks (see Figure 1). On large language models, one can also encounter hallucinations from these models, where they provide unjustified affirmation.
Over the last years, considerable academic work has been done to assess the trustworthiness of AI systems and prevent these faulty behaviours. Indeed, as AI has also been applied in various domains, medical, energy, transport… and sometimes in critical or high-risk systems, these deficiencies may potentially lead to environmental, material, economic or even human damages. The European AI Act further enforces this need of evaluation and assessment of AI systems in high-risk environment and paves the way for future certification standards of trustworthy AI.
How can we assess AI systems?
A first approach in evaluating an AI system is to test it extensively. This empirical approach allows one to detect various defect and measure the reaction of the system against different perturbation. With these methods, we can, for example, evaluate the robustness of an AI decision against noise, blur, luminosity change or rotations of the input. Adversarial attacks, as mentioned before, can also be included in that framework and tools like AIMOS are able to generate a large spectrum of tests to uncover the various bias and fragility of the AI systems.
While this first approach provides a way to test a large variety of perturbation on the inputs of an AI system, it brings no guarantees that an untested point may lead to an undesired behaviour. For this, formal approaches have been developed with tools like PyRAT to provide mathematical guarantees of the safety and robustness of an AI system. These methods are able to prove formally that given certain inputs the AI will answer a certain decision, e.g. for an autonomous driving vehicle, if speed is between 10km/h and 30km/h and distance to obstacle is under 500m, the AI system will answer “Brake”. These methods can also be used to prove that no adversarial example nor perturbation may make the AI system change its decision as long as it is under a certain size.
Finally, assessment of an AI system can also come through explainability. This is the ability to explain and interpret the decision taken by an AI system. Numerous methods exists to explicit the decision taken by an AI from which we can exhibit two different approaches: post-hoc explanations or explanation by-design. While the former try to extract an explanation from an already trained AI model, the latter try to build an AI model explainable by-design, i.e. with built-in explanations that are directly exposed to the user and provided alongside the decision. Such models can be based for example on decision trees or logical reasoning. Tools like CaBRNet provide an easy way to build such explanable by-design models increasing the trust one can have in these models.
All these methods used in conjunction can allow a more trustworthy use of an AI system, allowing the user to have some guarantees that its AI system will work as intended. In TRUMPET project, we aim to use these methods to provide added trust on top of the privacy preserving techniques providing evidence of robustness of the decision taken by the TRUMPET tools and ensuring that the increase in privacy does not built fragile models.
Augustin Lemesle (CEA)
Foto di Google DeepMind