May 16, 2024

News Collective

Complete New Zealand News World

Microsoft’s new tool simulates the human voice by listening to it for just three seconds

Microsoft’s new tool simulates the human voice by listening to it for just three seconds

VALL-E can preserve the emotional tone of the original speaker and even simulate its acoustic environment.

Microsoft engineers have developed VALL-E, a new artificial intelligence (AI) tool, It can simulate a person’s voice after only listening to it for 3 seconds. The app is based on an audio compression technology called “EnCodec,” which was developed by Meta (classified in Russia as an extremist organization), its authors reported in a pending post under review.

Microsoft took advantage of EnCodec technology as a way to make text-to-speech (TTS) syntax look realistic, based on a very limited source sample. During the AI ​​training phase, they spent 60,000 hours speaking Englishwhich is hundreds of times larger than current systems.

Advantages

According to its creators, VALL-E displays learning capabilities in context and can be used to synthesize custom, high-quality audio with just 3 seconds of recording. The results of the experiment show that VALL-E is vastly superior to the latest TTS systems (not trained with the voice it simulates), In terms of the naturalness of speech and the similarity of the speaker. Furthermore, they argue that VALL-E can preserve the speaker’s emotion and vocal environment in the text-synthesized speech message.

shortcomings

Despite its notable achievements, Microsoft researchers have drawn attention to some issues with the tool. In particular, they criticized that some words may be unclear, missing or repetitive in speech synthesis. Another aspect pointed out that it still can’t cover everyone’s voice, especially the sound of accent speakers.. They also argued that the variety of speaking styles is not enough, because LibriLight (the database they used for training) is a phonemic dataset, where most words are in the reading style.

See also  SJCAM SJ20 Dual Lens, the first dual-lens action camera for exceptional night shots

Risks

Microsoft engineers warned about this VALL-E can synthesize speech that preserves the speaker’s identity, which may carry potential risks of form abuse. An example of this could be voice identification impersonation or impersonation of a specific speaker to produce deepfakes.

Deepfakes, or deepfakes, are video, image, or audio files created using artificial intelligence software to very realistically impersonate the protagonists of content with images of other people.

See also

A Google engineer has been suspended by the company after making sure the AI ​​system has a conscience

(With info from RT)