People are unable to detect over one in four “deepfake” speech samples that can be employed by criminals, according to new research.
The British study is the first to assess human ability to detect artificially generated speech in a language other than English. Deepfakes are synthetic media intended to resemble a real person’s voice or appearance. They have been used by criminal gangs to con people out of large sums of cash.
Deepfakes fall under the category of generative artificial intelligence (AI), a type of machine learning (ML) that trains an algorithm to learn the patterns and characteristics of a dataset, such as video or audio of a real person, so that it can reproduce original sound or imagery.
While early deepfake speech algorithms may have required thousands of samples of a person’s voice to be able to generate original audio, the latest pre-trained algorithms can recreate a person’s voice using just a three-second clip of them speaking.
Tech giant Apple recently announced software for iPhone and iPad that allows a user to create a copy of their voice using 15 minutes of recordings.
Researchers at University College London (UCL) used a text-to-speech (TTS) algorithm trained on two publicly available datasets, one in English and one in Mandarin, to generate 50 deepfake speech samples in each language.
The samples were different from the ones used to train the algorithm to avoid the possibility of it reproducing the original input.
The artificially generated samples and genuine samples were played for 529 participants to see whether they could detect the real thing from fake speech.
Participants were only able to identify fake speech 73 percent of the time, according to the findings published in the journal PLOS ONE.
The research team said that the figure improved only slightly after they received training to recognize aspects of deepfake speech.
English and Mandarin speakers showed similar detection rates, although when asked to describe the speech features they used for detection, English speakers more often referenced breathing, while Mandarin speakers more often referenced cadence, pacing between words, and fluency.
“Our findings confirm that humans are unable to reliably detect deepfake speech, whether they have received training to help them spot artificial content,” said the Study first author Kimberly Mai.
“It’s also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future.” said Ms. Mai, a machine learning PhD Student at UCL.
The research team now plans to develop better-automated speech detectors.
Though there are benefits from generative AI audio technology, such as greater accessibility for those whose speech may be limited or who may lose their voice due to illness, there are growing fears that such technology could be used by criminals and nation-states to cause significant harm to individuals and societies.
Documented cases of deepfake speech being used by criminals include one 2019 incident where the CEO of a British energy company was convinced to transfer hundreds of thousands of pounds to a false supplier by a deepfake recording of his boss’s voice.
“With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks. It would be prudent for governments and organizations to develop strategies to deal with abuse of these tools, certainly, but we should also recognize the positive possibilities that are on the horizon.” said the study senior author Professor Lewis Griffin, of UCL.
Produced in association with SWNS Talker
Edited by Eunice Anyango Oyule and Judy J. Rotich