Microsoft's new tool simulates the human voice by listening to it for just three seconds

VALL-E can preserve the emotional tone of the original speaker and even simulate its acoustic environment.

Microsoft engineers have developed VALL-E, a new artificial intelligence (AI) tool, It can simulate a person’s voice after only listening to it for 3 seconds. The app is based on an audio compression technology called “EnCodec,” which was developed by Meta (classified in Russia as an extremist organization), its authors reported in a pending post under review.

Microsoft took advantage of EnCodec technology as a way to make text-to-speech (TTS) syntax look realistic, based on a very limited source sample. During the AI training phase, they spent 60,000 hours speaking Englishwhich is hundreds of times larger than current systems.

Advantages

According to its creators, VALL-E displays learning capabilities in context and can be used to synthesize custom, high-quality audio with just 3 seconds of recording. The results of the experiment show that VALL-E is vastly superior to the latest TTS systems (not trained with the voice it simulates), In terms of the naturalness of speech and the similarity of the speaker. Furthermore, they argue that VALL-E can preserve the speaker’s emotion and vocal environment in the text-synthesized speech message.

shortcomings

Despite its notable achievements, Microsoft researchers have drawn attention to some issues with the tool. In particular, they criticized that some words may be unclear, missing or repetitive in speech synthesis. Another aspect pointed out that it still can’t cover everyone’s voice, especially the sound of accent speakers.. They also argued that the variety of speaking styles is not enough, because LibriLight (the database they used for training) is a phonemic dataset, where most words are in the reading style.

Risks

Microsoft engineers warned about this VALL-E can synthesize speech that preserves the speaker’s identity, which may carry potential risks of form abuse. An example of this could be voice identification impersonation or impersonation of a specific speaker to produce deepfakes.

Deepfakes, or deepfakes, are video, image, or audio files created using artificial intelligence software to very realistically impersonate the protagonists of content with images of other people.

Microsoft’s new tool simulates the human voice by listening to it for just three seconds

Advantages

shortcomings

Risks

See also

Kindle Scribe Colorsoft Review: A Larger E-Reader With Smarter AI Features

Southland Local Government Reform Nears Key Decision as Public Faces Complex Process

New Shearing App Streamlines Payroll and Administration for New Zealand Contractors

University Team Proposes Retractable Pressurised Tunnels for Future Mars Missions

Astronomers Around the World Capture Spectacular Night Sky Images in ZWO Photography Competition

More Articles Like This

About us

Latest News

Kindle Scribe Colorsoft Review: A Larger E-Reader With Smarter AI Features

Southland Local Government Reform Nears Key Decision as Public Faces Complex Process

New Shearing App Streamlines Payroll and Administration for New Zealand Contractors

Pages