FB AI distinguishes multiple speakers simultaneously

栏目: IT技术 · 发布时间: 3年前

内容简介:We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise an

What the research is

We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise and reverberations. Using the WSJ0-2mix and WSJ0-3mix data sets, along with newly created variations with four and five simultaneous speakers, our model achieved a scale-invariant SI-SNR (signal-to-noise ratio, a common measure of separation quality) improvement of more than 1.5 dB (decibels) over the current state-of-the-art models.

To build our model, we use a novel recurrent neural network architecture that works directly on the raw audio waveform. Previously best-available models use a mask and a decoder to sort each speaker’s voice. The performance of these kinds of models rapidly degrades when the number of speakers is high or unknown.

As with standard speech separation systems, our model requires knowledge of the total number of speakers in advance. But in order to handle challenges when the number of speakers is unknown, we built a novel system that automatically detects the number of speakers and selects the most relevant model.

How it works

The main goal of speech separation models is to estimate the input sources, given an input mixture of speech signals, and generate an output of isolated channels for each speaker.

Our model uses an encoder network that maps the input signal to a latent representation. We applied a voice separation network composed of several blocks, where the input is the latent representation and the output is an estimated signal for each speaker. Previous methods typically use a mask when performing separation, which is problematic when the mask is not defined and some signal information may be lost in the process.

We trained the model and directly optimized the SI-SNR using several loss functions via the permutation invariant training. We inserted a loss function after every separation block to further improve the optimization process. Finally, to ensure each speaker is consistently mapped to a particular output channel, we added a perceptual loss function using a pretrained speaker recognition model.

We also built a new system to handle separation of unknown numbers of multiple speakers. We did this by training different models for separating two, three, four, and five speakers. We fed the input mixture to the model designed to accommodate up to five simultaneous speakers so that it would detect the number of active (nonsilent) channels present. Then, we repeated the same process with a model trained for the number of active speakers and checked to see whether all output channels were active. We repeated this process until either all channels were activated or we found the model with the lowest number of target speakers.

Why it matters

The ability to separate a single voice from conversations across many people can improve and enhance communication across a wide range of applications that we use in our daily lives, like voice messaging, assistants, and video tools, as well as AR/VR innovations. It can also improve audio quality for people with hearing aids, so it’s easier to hear others clearly in crowded and noisy environments such as parties, restaurants, or large video calls.

Beyond its separating different voices, our novel system can also be applied to separate other types of speech signals from a mixture of sounds such as background noise. Our work can also be applied to music recordings, improving our previous work on separating different musical instruments from a single audio file. As a next step, we’ll work on improving the generative properties of the model until it achieves high performance in real-world conditions.

Read the full paper:

Voice separation with an unknown number of multiple speakers

Check out the audio samples here.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Tomcat与Java Web开发技术详解

Tomcat与Java Web开发技术详解

孙卫琴 / 电子工业出版社 / 2004-4-1 / 45.00元

《Tomcat与Java Web开发技术详解》编辑推荐:Jakarta Tomcat服务器是在SUN公司的JSWDK(JavaServer Web DevelopmentKit,SUN公司推出的小型Servlet/JSP调试工具)的基础上发展起来的一个优秀的Java Web应用容器,它是Apache-Jakarta的一个子项目。Tomcat被JavaWorld杂志的编辑选为2001年度最具创新的J......一起来看看 《Tomcat与Java Web开发技术详解》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换