内容简介:We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise an
What the research is
We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise and reverberations. Using the WSJ0-2mix and WSJ0-3mix data sets, along with newly created variations with four and five simultaneous speakers, our model achieved a scale-invariant SI-SNR (signal-to-noise ratio, a common measure of separation quality) improvement of more than 1.5 dB (decibels) over the current state-of-the-art models.
To build our model, we use a novel recurrent neural network architecture that works directly on the raw audio waveform. Previously best-available models use a mask and a decoder to sort each speaker’s voice. The performance of these kinds of models rapidly degrades when the number of speakers is high or unknown.
As with standard speech separation systems, our model requires knowledge of the total number of speakers in advance. But in order to handle challenges when the number of speakers is unknown, we built a novel system that automatically detects the number of speakers and selects the most relevant model.
How it works
The main goal of speech separation models is to estimate the input sources, given an input mixture of speech signals, and generate an output of isolated channels for each speaker.
Our model uses an encoder network that maps the input signal to a latent representation. We applied a voice separation network composed of several blocks, where the input is the latent representation and the output is an estimated signal for each speaker. Previous methods typically use a mask when performing separation, which is problematic when the mask is not defined and some signal information may be lost in the process.
We trained the model and directly optimized the SI-SNR using several loss functions via the permutation invariant training. We inserted a loss function after every separation block to further improve the optimization process. Finally, to ensure each speaker is consistently mapped to a particular output channel, we added a perceptual loss function using a pretrained speaker recognition model.
We also built a new system to handle separation of unknown numbers of multiple speakers. We did this by training different models for separating two, three, four, and five speakers. We fed the input mixture to the model designed to accommodate up to five simultaneous speakers so that it would detect the number of active (nonsilent) channels present. Then, we repeated the same process with a model trained for the number of active speakers and checked to see whether all output channels were active. We repeated this process until either all channels were activated or we found the model with the lowest number of target speakers.
Why it matters
The ability to separate a single voice from conversations across many people can improve and enhance communication across a wide range of applications that we use in our daily lives, like voice messaging, assistants, and video tools, as well as AR/VR innovations. It can also improve audio quality for people with hearing aids, so it’s easier to hear others clearly in crowded and noisy environments such as parties, restaurants, or large video calls.
Beyond its separating different voices, our novel system can also be applied to separate other types of speech signals from a mixture of sounds such as background noise. Our work can also be applied to music recordings, improving our previous work on separating different musical instruments from a single audio file. As a next step, we’ll work on improving the generative properties of the model until it achieves high performance in real-world conditions.
Read the full paper:
Voice separation with an unknown number of multiple speakers
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
精通数据科学:从线性回归到深度学习
唐亘 / 人民邮电出版社 / 2018-5-8 / 99.00元
数据科学是一门内涵很广的学科,它涉及到统计分析、机器学习以及计算机科学三方面的知识和技能。本书深入浅出、全面系统地介绍了这门学科的内容。 本书分为13章,最初的3章主要介绍数据科学想要解决的问题、常用的IT工具Python以及这门学科所涉及的数学基础。第4-7章主要讨论数据模型,主要包含三方面的内容:一是统计中最经典的线性回归和逻辑回归模型;二是计算机估算模型参数的随机梯度下降法,这是模型工......一起来看看 《精通数据科学:从线性回归到深度学习》 这本书的介绍吧!