Scientists at Microsoft have recently released a report regarding a text-to-speech system. Named VALL-E, Microsoft’s new system doesn’t come as a surprise, especially given that examples like “Sister Google” have already gained significant popularity among internet users.
However, upon reading the report, we can uncover some surprising and even chilling details. The scientists assert that VALL-E “can be used to synthesize a person’s voice with high quality, using only a 3-second audio clip of an unidentified speaker.”
VALL-E Virtual Voice Generation Software has potential but also brings many risks – (Image: Internet).
In other words, Microsoft’s system only needs to hear us speak for 3 seconds to synthesize a voice that closely resembles the original. According to the report, the database used to train VALL-E was compiled by Meta (the parent company of Facebook) and consists of 60,000 hours of speech recorded by 7,000 individuals.
Freelance technology journalist Chris Matyszczyk listened to some audio clips and shared his impressions on ZDNet. He heard a male voice speaking for 3 seconds, followed by an 8-second audio clip produced by VALL-E, and remarked that it was difficult to discern which was the human speaker and which was the AI-generated sound.
Although VALL-E’s choice of words does not yet fully mimic human speech, he still found it “scary.”
Most of us are accustomed to automated calls, where a pre-recorded voice or an automated voice can be heard on the other end. With a system like VALL-E, machine-generated voices can achieve an unprecedented level of polish.
It is indeed challenging to speculate on what the future may hold when malicious actors could potentially exploit a phone call to record your voice and impersonate you to deceive others. This concern is heightened by researchers claiming they can recreate “emotions and sound environments” using just a 3-second recording.
The researchers, who are the creators of the VALL-E system, do not offer any innovative solutions, suggesting that the best current approach is to develop a system to detect voice synthesis created by machines. It’s hard to question why they are doing this, as in the tech industry, most responses would be “if it can be done, it should be done.”