Accurate Online Speaker Diarization with Supervised Learning

内容简介：In “

Speaker diarization , the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations , video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren’t involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.

Accurate Online Speaker Diarization with Supervised Learning

Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.

In “ Fully Supervised Speaker Diarization ”, we describe a new model that seeks to make use of supervised speaker labels in a more effective manner. Here “fully” implies that all components in the speaker diarization system, including the estimation of the number of speakers, are trained in supervised ways, so that they can benefit from increasing the amount of labeled data available. On the NIST SRE 2000 CALLHOME benchmark, our diarization error rate (DER) is as low as 7.6%, compared to 8.8% DER from our previous clustering-based method , and 9.9% from deep neural network embedding methods . Moreover, our method achieves this lower error rate based on online decoding, making it specifically suitable for real-time applications. As such we are open sourcing the core algorithms in our paper to accelerate more research along this direction.

Clustering versus Interleaved-state RNN

Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering . Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To understand how this works, consider the example below in which there are four possible speakers: blue , yellow , pink and green (this is arbitrary, and in fact there may be more — our model uses Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow , comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y ₇ in the figure below. If new speaker green enters, it will start with a new RNN instance.)

The generative process of our model. Colors indicate labels for speaker segments.

Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.

Future Work

Although we’ve already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.

To learn more about this work, please see our paper . To download the core algorithm of this system, please visit the Github page .

Acknowledgments

This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.

除非特别声明，此文章内容采用知识共享署名 3.0 许可，代码示例采用 Apache 2.0 许可。更多细节请查看我们的服务条款。

以上所述就是小编给大家介绍的《Accurate Online Speaker Diarization with Supervised Learning》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Accurate Online Speaker Diarization with Supervised Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

JSP信息系统开发实例精选

赛奎春 / 机械工业出版社 / 2006-1 / 44.00元

本书精选了大学生就业求职网、物流短信平台、化奥汽车销售集团网站、佳美网络购物中心、科研成果申报管理系统、安瑞奥国际商务不院招生网、明日宽带影院、雄霸天下游戏网等8个综合的网络信息系统工程作为案例，深入剖析了实际的网络信息系统的开发思路、方法和技巧。帮助读者透彻掌握JSP开发网络信息系统的方法和步骤，从而设计出具有实用价值和商用价值的信息系统。　　本书产例具有很强的实用性和工程实践性，在讲解......一起来看看《JSP信息系统开发实例精选》这本书的介绍吧!

码农工具

Accurate Online Speaker Diarization with Supervised Learning

JSP信息系统开发实例精选

UNIX 时间戳转换

RGB HSV 转换

HEX CMYK 转换工具