At Barcelona’s Sónar festival last fall, artist and researcher Mat Dryhurst stepped up to the microphone and began to sing, but the voice of his wife—the electronic musician and technologist Holly Herndon—came out instead. When Dryhurst giggled, the sound was unmistakably hers, high and clear like a bell—and not, as far as anyone could hear, some kind of electronic trick, but as seemingly real as the sound of any human larynx can be.
The performance was part of a demonstration of Holly+, Herndon’s latest experiment in artificial intelligence, which takes one sound and, through the magic of a neural net, turns it into another. Imagine Nicolas Cage and John Travolta swapping visages in Face Off, only this time it’s their voices that trade places.
The effect—watching Herndon’s voice emit from Dryhurst’s mouth—was uncanny. It was also a likely sign of things to come, of a world of shapeshifting forms looming on the horizon: identity play, digital ventriloquism, categories of art and artifice we don’t even have names for yet. The audiovisual forgeries known as deepfakes have been around since the late ’10s, and the technique is becoming increasingly common in pop culture; just this month, Kendrick Lamar’s “The Heart Part 5” video eerily morphed the rapper’s likeness into the faces of O.J. Simpson, Will Smith, and Kanye West.
Trending Now
Explore Patti Smith’s Horses (in 5 Minutes)
But the real-time aspect of Holly+ feels new. Created in collaboration with AI researchers and instrument builders Never Before Heard Sounds, Holly+ is a vocal model, a species of deep neural network, that has been trained on Herndon’s voice. To make it, she recorded hours and hours of speaking and singing, from which the system learned to synthesize her vocal timbre. Feed the system a line of text or a snippet of audio, and it regurgitates the sound back in Herndon’s voice.
You can try it out right now, using the web interface. There are caveats; the online tool doesn’t have the fidelity of what was heard in Herndon’s Sónar presentation. The sound is tinny, crinkly like cellophane, haunting. When the outgoing signal is garbled, it sounds like electronic voice phenomena, or EVP—unintelligible recordings of audio interference ostensibly emanating from the spirit world. Just for fun, I fed it a snippet of Alvin Lucier’s “I Am Sitting in a Room,” a landmark 1969 tape-music composition based on vocal modulations, and what came back sounded like it might have come from a horror film.