Research

Deepfake: threat or opportunity?

An interview with Nicolas Obin.

Published on 29/02/2024 - Updated on 20/03/2024

Reading time 7 min.

First appearing on Reddit in 2017, the deepfake phenomenon is increasing from year to year. Nicolas Obin, associate professor at Sorbonne University and researcher at Ircam, takes stock of these digital manipulations which switch faces and bodies, transform voices and even resurrect the voices of the dead.

Could you explain to us what a deepfake is?

Nicolas Obin: The term “deepfake” is a contraction of “deep learning”, which refers to the learning of deep neural networks used in artificial intelligence. Appearing less than ten years ago, the first deepfakes were digital manipulations used to apply the identity of a person--generally publicly known--in pornographic film scenes.

Since then, numerous hyper-faked videos using artificial intelligence techniques have flourished on the web, for example, a fake Barack Obama presenting a warning about the risks of deepfakes, or a fake Donald Trump announcing the eradication of AIDS.

You work at IRCAM, a center created under the leadership of Pierre Boulez, dedicated to scientific research and technological innovation for musical creation. How did you become interested in deepfakes?

N. O.: My main work doesn't revolve around deepfakes. IRCAM's missions are essentially linked to the development of new technological means of expression for sound and musical creation. In my research, I am interested in the generative modeling of human behavior, particularly sound and voice. Historically, we built algorithms that could, for example, transform a male voice into a female voice in real time, make it younger, older, etc. With the rise of deep neural networks around 2015, realism took on another dimension.

In 2018, Google researchers produced a synthetic voice considered as natural as a human voice. But at the time, they had to rely on nearly 20 hours of voice recording. Today, we are able to clone a voice identity with even greater realism from an extract of just 5 to 10 seconds. This shift towards ultra-realism means that these new sound and audiovisual generations potentially pose security problems.

And our expertise in this area is of interest to people specializing in cyber security to optimize the AI that can detect them. The improvement of generation systems is progressing in tandem with that of detection systems. The more the counterfeiter improves, the more the detector must improve and vice versa. This is why we have recently developed collaborations on this subject.

What are the potential dangers of deepfakes, apart from obvious cases of information manipulation?

N. O.: As a semantic manipulation of audiovisual content, deepfakes can be used for malicious purposes such as identity theft and more. Two properties of modern AI make deepfakes particularly dangerous: on the one hand, the realism of generations made possible by the combination of efficient learning algorithms and the mass of data available to carry out this learning; on the other hand the democratization of these tools with shared resources (singer voice models, for example, are freely accessible on communication channels).

Today everyone constantly publishes personal data on networks that are largely publicly accessible. As a result, anyone is susceptible to being a victim of a deepfake. However, public figures are much more exposed because of the quantity of freely accessible data. These malicious attacks can then prove critical in the case of government representatives, as we recently saw with the false declarations of Volodymyr Zelensky or Joe Biden.

But there are other, more pernicious manipulations, such as those of emotions, which is aimed at our mental states. For example, a voice assistant could have emotional or expressive interactions and influence our behaviors by influencing our emotions, or by encouraging us to buy something. In politics, the same speech could be addressed to each citizen with variations of tone adapted to obtain an optimal persuasive effect.

Do you think deepfakes could be used for more positive creative uses?

N. O.: Without a doubt. An artist doesn't limit themselves to the realistic (or ultra-realistic) imitation of reality; on the contrary they seek to sublimate it to create new, singular, unheard of or unseen worlds. For this, artists have always used every means at their disposal (natural or artificial). However, the possibilities offered by AI and the spectacular renderings of deepfakes constitute a tremendous opportunity for creation.

At IRCAM, we have already produced various applications for the recreation of historical figures, from the poetic film Marilyn by Philipe Parreno, to Dalida's interview by Thierry Ardisson, or even General de Gaulle's speech reinterpreted with François Morel. Of course, we do not claim to have recreated a historical archive. In the case of General de Gaulle's speech, for example, we transferred timbres from recordings of De Gaulle's voice, but it was François Morel who performed the interpretation. But even if we say the same words, the way we say them, with a breath, a sigh, is highly significant and can change the interpretation of the message.

How can we spot deepfakes and protect against their spread?

N. O.: Due to their ultra-realism, it's becoming more and more difficult, if not impossible, to distinguish a real thing from a fake. However, there may still be clues, such as distortions or inconsistencies in lip synchronization or between facial expressions. But they are becoming more and more subtle. However, any manipulation leaves a characteristic trace, even imperceptible by a human being. Detecting these traces by AI requires finding and identifying them. The problem is that there is a wide variety of generation algorithms, which greatly increases the complexity of identifying them. And since the algorithm used for generation is unknown when we have to try to identify a deepfake, it becomes extremely difficult to provide a universal detection solution that is robust to all forms of attack.

What do your DeTOX and BRUEL projects propose in terms of fighting against video and audio deepfakes?

N. O.: The DeTOX project is carried out in partnership with Eurecom (an engineering school and digital sciences research center.) Instead of trying to develop a universal solution that is too complex to implement and ultumately unreliable, our goal is to offer a personalized response to attacks for deepfake detection for targeted famous people.

The BRUEL project, carried out in collaboration with the Avignon computer laboratory, Eurecom, the Atomic Energy Commission and the national judicial police center, is interested in the detection of audio deepfakes. It explores the possibilities of combining manipulation detection with speaker authentication.

Furthermore, we are trying to define a grid allowing attacks to be staggered according to their complexity of implementation (such as expertise, material and logistical resources). This makes it possible, for example, to distinguish an attack carried out by an individual without particular expertise, with freely accessible and easily usable means, from attacks carried out by States requiring strong expertise and substantial resources. We then want to evaluate the reliability of the detection algorithms depending on the level of attack.

What are your projections for the evolution of deepfake technology in the coming years?

N. O.: If proposals are made to block deepfakes at the source, from the moment they are captured, by creating a sort of seal of authenticity, the main response is, in my opinion, not technological: a broad education campaign on the risks of digital must be considered.

Faced with the massive surge of false information on our social networks, their speed of propagation and their ever-increasing realism, any data mediated by digital technology must be subject to caution and the exercise of systematic doubt. We must learn to verify information, for example by cross-checking it with other sources.

Nicolas Obin

Associate professor at the Faculty of Science and Engineering at Sorbonne University and researcher in the Analysis and Synthesis of Sounds team in the Music and Sound Sciences and Technologies laboratory (co-supervised by Ircam, CNRS, Sorbonne University, Ministry of Culture), Nicolas Obin is particularly interested in communication between humans, animals, and robots, particularly vocal communications, and in the modeling of human behavior. He specializes in generative modeling of sound signals and in particular for the simulation of complex human productions (singing, speech, music) with various applications in the synthesis and transformation of speech, such as vocal assistants, animation of virtual agents, humanoid robotics and deepfakes.

Responsible for the Intelligent Systems Engineering (ISI) master's degree and co-responsible for the "Deep Learning through practice" professional training program, he teaches digital audio signal processing, deep learning, and biometrics.

In 2020, he founded DeepVoice, a Parisian event centered on voice technologies and artificial intelligence; in 2021, SophIA, the Sorbonne University student association for Artificial Intelligence in collaboration with SCAI, and, in 2022, "Fast-Forward," informal and experimental meetings science and technology and sound design in cinema which brings together the community of cinema sound designers to imagine the sound practices of the future.

Involved in the promotion of digital sciences and technologies for the arts, culture and heritage, he has collaborated with musicians and artists such as Eric Rohmer, Philippe Parreno, Roman Polansky, Leos Carax, George Aperghis and Alexander Schubert.